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ABSTRACT 


In  this  paper  we  discuss  several  recent  conjufate-fradient  type  methods  for 
solving  large-scale  nonlinear  optimisation  problems.  We  demonstrate  how  the 
performance  of  these  methods  can  be  significantly  improved  by  careful  implemen¬ 
tation.  A  method  based  upon  iterative  preconditioning  will  be  suggested  which 
performs  reasonably  efficiently  on  a  wide  variety  of  significant  test  problems. 

Our  results  indicate  that  nonlinear  conjugate-gradient  methods  behave  in  a 
similar  way  to  conjugate-gradient  methods  for  the  solution  of  systems  of  linear 
equations.  These  methods  work  best  on  problems  whose  Hessian  matrices  have  sets 
of  clustered  eigenvalues.  On  more  general  problems,  however,  even  the  best  method 
may  require  a  prohibitively  large  number  of  iterations.  We  present  numerical 
evidence  that  indicates  that  the  use  of  theoretical  analysis  to  predict  the  perfor¬ 
mance  of  algorithms  on  general  problems  is  not  straightforward.  A 
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1.  Introduction 


This  report  describee  the  progress  of  •  search  for  an  efficient  algorithm  to  find 
the  minimum  of  a  general  nonlinear  function  subject  to  upper  and  lower  bounds 
upon  the  variables.  In  particular  we  are  interested  in  solving  problems  for  which 
the  number  of  variables  n  is  very  large,  say  of  the  order  of  several  hundred. 

Our  interest  in  the  solution  of  bound-constrained  problems  as  opposed  to 
unconstrained  problems  steins  from  two  observations.  Firstly,  it  is  rare  for  there 
to  be  no  constraints  at  all  upon  the  individual  variables.  Secondly,  even  if  the 
solution  does  not  lie  upon  any  of  the  bounds,  their  presence  can  prevent  the  func¬ 
tion  from  being  evaluated  at  unreasonable  or  nonsensical  points.  Intuitively,  one 
would  expect  that  it  would  be  possible  to  construct  a  more  efficient  algorithm  by 
providing  more  information  about  the  region  in  which  the  solution  i6  expected  to 
lie.  For  some  classes  of  algorithm  this  is  indeed  the  case,  but  we  shall  show  in  later 
sections  that  the  presence  of  bounds  upon  the  variables  may  adversely  affect  the 
performance  of  conjugate-gradient  methods. 

The  most  successful  algorithms  for  bound-constrained  minimization  proceed 
as  follows.  At  any  stage  of  the  algorithm  the  variables  are  partitioned  into  two 
seta:  the  set  of  fixed  variables  which  are  at  their  upper  or  lower  bounds,  and 
the  set  of  free  variables  which  are  currently  being  optimized.  An  unconstrained 
minimisation  is  performed  with  respect  to  the  free  variables.  This  unconstrained 
problem  is  altered  occasionally  if  a  free  variable  violates  a  bound  or  a  fixed  variable 
is  allowed  to  become  free.  (F or  the  precise  details  of  how  the  free  variables  are 
selected,  see  Gill  and  Murray,  1976).  Clearly,  bound-constrained  minimization  is 
closely  related  to  unconstrained  minimisation  and  this  is  reflected  in  the  content 
of  this  paper. 

Probably  the  most  commonly-used  techniques  for  minimising  a  general  un¬ 
constrained  nonlinear  function  are  the  class  of  quasi-Newton  methods  (see  Dennis 
and  Mor4,  1977,  for  a  survey).  However,  such  methods  require  the  storage  of 
an  n  X  n  approximate  Hessian  matrix  of  second  derivatives  and  as  n  becomes 
large  these  methods  become  impractical.  The  first  algorithm  that  could  be  applied 
specifically  to  large-scale  unconstrained  optimisation  was  due  to  Fletcher  and 
Reeves  (1964),  their  algorithm  being  a  generalisation  of  the  Hestenes  and  Stiefel 
conjugate-gradient  method  for  solving  the  linear  equations  Ax  ^  b  for  a  positive- 
definite  symmetric  n  X  n  matrix  A.  The  Hestenes  and  Stiefel  algorithm  is  iterative 
and  if  no  rounding  error  is  made,  requires  n  iterations  or  fewer  to  find  a  solution. 
A  fundamental  advantage  of  the  method  is  that  no  matrix  storage  is  required  over 
and  above  the  storage  of  the  problem  itself. 

The  work  of  Hestenes  and  Stiefel  was  motivated  in  part  by  a  misconception 
that  prevailed  in  the  early  1950's  concerning  the  numerical  stability  of  direct 


methods  for  solving  linear  equations.  It  was  believed  that  the  rounding  error 
involved  in  solving  even  small  systems  of  equations  would  always  prevent  an 
accurate  answer  from  being  found.  It  was  expected  that  an  iterative  procedure 
such  as  the  conjugate-gradient  method  would  prove  to  be  inherently  more  stable 
because  an  inaccurate  iterate  would  automatically  be  refined  in  subsequent  itera¬ 
tions.  Ironically  the  opposite  is  true;  the  conjugate-gradient  method  is  far  more 
susceptible  to  rounding  error  than  a  typical  direct  method.  Moreover,  rounding 
error  may  cause  the  algorithm  to  require  many  more  than  n  iterations  to  find  the 
solution.  This  feature  of  the  algorithm  is  particularly  disappointing,  especially 
when  it  can  be  argued  that  even  n  is  an  excessive  number  of  iterations  for  large 
problems.  However,  even  when  finite- precision  arithmetic  is  being  used,  there 
are  linear  equations  which  can  be  solved  in  far  fewer  than  n  iterations.  These 
problems  tend  to  have  coefficient  matrices  whose  eigenvalues  are  clustered  into 
seta  containing  eigenvalues  of  similar  magnitude. 

The  development  of  conjugate-gradient  methods  for  optimization  closely  paral¬ 
leled  that  of  conjugate-gradient  methods  for  linear  equations  in  the  sense  that 
initial  enthusiasm  for  the  method  was  quickly  dispelled  by  disappointing  numerical 
performance.  Extensive  testing  during  the  late  1960’s  and  early  1970'b  showed 
that  the  Flelcher-Reeves  algorithm  was  generally  inferior  to  the  best  of  the  alter¬ 
native  methods  whenever  the  storage  of  an  n  X  n  matrix  was  not  an  impediment 
to  the  application  of  alternative  methods.  During  the  last  few  years  there  have 
been  significant  improvements  in  the  design  of  conjugate-gradient  algorithms  (we 
shall  discuss  some  of  these  improvements  in  this  paper),  but  in  many  cases  these 
have  been  more  than  outweighed  by  improvements  in  the  rival  techniques.  In  par¬ 
ticular  there  has  been  considerable  interest  in  modified-Newton  and  quasi-Newton 
methods  for  nonlinear  problems  with  sparse  Hessian  matrices  (Gill  and  Murray, 
1973;  Curtis,  Powell  and  Reid,  1974;  Toint,  1977). 

However,  there  is  a  class  of  unconstrained  problems  for  which  conjugate- 
gradient  methods  are  currently  the  only  techniques  that  can  be  applied.  This  is 
the  class  of  problems  for  which  the  Hessian  matrix  is  very  large,  but  not  sparse. 
Such  problems  arise  in  large-scale  linearly-constrained  and  nonlincarly-constrained 
minimization  (see  Gill  and  Murray,  1974b;  Murray  and  Wright,  1978).  In  this 
situation  the  Hessian  matrices  are  of  the  form  ZTGZ  where  G  and  Z  arc  large 
matrices  which  may  be  sparse,  but  whose  product  ZrGZ  is  large  and  dense. 

In  Section  2  we  discuss  the  Fletcher-Reeves  conjugate-gradient  method  and  its 
recent  improvements  by  Beale  and  others.  We  demonstrate  how  the  performance 
of  these  algorithms  can  be  significantly  improved  by  careful  implementation.  In 
Section  3  we  consider  the  class  of  limited-memory  quasi-Newton  methods  suggested 
by  Perry  (1977)  and  Shanno  (1978a).  This  is  followed  by  a  discussion  of  methods 
which  attempt  at  each  iteration  to  precondition  the  problem  to  that  the  rate  of 
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convergence  of  conjugate-gradient  end  limited- memory  quasi-Newton  methods  can 
be  improved.  A  diagonal  preconditioning  technique  is  suggested  which  can  be  used 
improve  the  performance  of  moat  conjugate-gradient  type  methods. 

Finally,  in  Section  7  we  describe  some  extensive  numerical  tests  which  indicate 
that  nonlinear  conjugate-gradient  methods  behave  very  similarly  to  conjugate- 
gradient  methods  for  linear  equations.  Problems  whose  Hessian  matrices  at  the 
solution  contain  sets  of  clustered  eigenvalues  may  be  minimized  in  significantly 
fewer  than  n  iterations.  Problems  without  this  property  may  require  anything 
from  between  n  and  5n  iterations,  with  approximately  2n  iterations  or  fewer  being 
a  common  figure  for  moderately  difficult  problems.  The  numerical  results  sug¬ 
gest  that  a  preconditioning  based  upon  a  diagonal  scaling  may  lead  to  significant 
improvements  in  performance  on  general  problems. 


2.  Tbs  traditional  conjugate  gradient  method  and  its  modifications 
2.1  The  Fletcher-Reeves  algorithm 


The  Fletcher-Reeves  conjugate-gradient  algorithm  for  minimizing  a  general 
nonlinear  function  F(x)  proceeds  as  follows.  Let  z%  be  a  given  starting  point  and 
let  k  denote  the  current  iteration,  starting  with  k  —  0.  The  iteration  requires  gk, 
the  gradient  vector  VF(x)  evaluated  at  x*,  the  fc-th  estimate  of  the  minimum.  At 
each  iteration  a  vector  (known  as  the  direction  of  search  )  is  computed  and 
the  new  estimate  x*+i  is  given  by  xs  +  a*pk  where  a*  (the  step  length )  minimises 
the  function  F(x*  +  a*ps)  with  respect  to  the  scalar  a*.  During  the  first  iteration, 
pa  is  just  the  steepest-descent  direction  — g(x o).  On  completion  of  the  fc-th  linear 
minimisation,  the  direction  of  search  for  the  next  iteration  is  found  from  the 
formula 

PM- 1  ■»  —  JM-i  +  (2-1) 


where 


(2.2) 


and  |H|,  denotes  the  Euclidean  norm  of  a  vector  u. 

When  we  are  minimising  a  quadratic  function  F(x)  —  crx  -f-  \xtQi  with  Q  a 
symmetric  positive-definite  matrix  and  c  an  rs-vector,  the  directions  obtained  from 
the  Fletcher-Reeves  algorithm  are  identical  to  those  of  the  Hestenes  and  Stiefel 
conjugate-gradient  method  for  solving  the  linear  equations  Qx  —  — c  (Hestenes 
and  Stiefel,  1952).  In  the  quadratic  case  the  step  length  a*  can  be  computed  in 
closed  form  as  a*  —  —gkPkJplQpk,  moreover  it  can  be  shown  (see,  for  example, 


A 


Fletcher,  1972)  that  the  directions  of  search  are  mutually  conjugate,  i.e. 

P?QP}  —  °.  »  7*  J< 

and  that  the  set  of  gradient  vectors  {9,}  arc  mutually  orthogonal,  i.e. 

»  /  >• 

It  is  relatively  easy  to  demonstrate  that  the  conjugacy  of  the  set  of  directions 
gives  n-step  termination  on  quadratic  functions.  If  we  make  the  transformation 
x  —  Pu,  where  P  is  the  matrix  with  columns  equal  to  the  directions  po,pi,. .., 
Pn—i,  then  the  transformed  quadratic  function  ^(u)  <=  P(Pu)  is  separable  in  the 
variables  u  and  can  be  minimised  by  successively  minimising  d  with  respect  to 
each  of  the  variables  u,  in  turn. 

The  formula  for  fa  used  by  Fletcher  and  Reeves  was  one  of  several  suggested 
by  Hestenes  and  Stiefel.  Theoretically  these  formulae  are  equivalent  for  a  quad* 
ratio  function  and  Hestenes  and  Stiefel  sought  to  choose  a  formula  with  the  best 
properties  when  the  computation  is  subject  to  rounding  error.  The  first  step  in 
the  derivation  of  all  the  formulae  for  fa  is  the  recognition  that  the  concept  of 
conjugtcy  may  be  replaced  by  one  of  orthogonuHty.  For  a  quadratic  function 
the  vector  14,  which  is  defined  as  the  difference  between  the  gradients  at  any  two 
iterates  x^^-i  and  X*,  is  given  by 

Vk  —  9k+\  —  9k  “*  1  —  **)  —  OkQpk- 

Consequently  the  conjugacy  condition  pjQPj  ■»  0  is  equivaleut  to  the  orthogonality 
condition  y[p}  ■■  0.  The  most  obvious  formula  for  fa  follows  from  pre-multiplying 
(2.1)  by  14  and  choosing  fa  such  that  y*p*+i  “  0,  i.e. 

A”  ylWi/l/IPfc-  (23) 

If  we  make  use  of  the  fact  that  pk  and  glpk—i  vanish  if  an  exact  linear  search 
is  made  then 

vl?k  —  slpk 

**  — JflX — 9k  +  fa-tPk-l) 

-  Halil 

This  leads  to  the  formula  given  by  Polak  and  Ribitre  (see  Polak,  1971): 

fa  —  vlfe-fi/llftll]'  (2-4) 
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Finally,  since  the  gradients  are  mutually  orthogonal  for  a  quadratic  function  we 
have  y*jfc+i  equal  to  ||sk+i||j,  giving  (2.2). 

For  general  nonlinear  functions  the  alternative  formulae  for  fa  arc  no  longer 
equivalent.  We  prefer  to  use  (2.3)  since  {4  is  orthogonal  to  Pk+\  irrespective  of 
the  accuracy  of  the  linear  search  or  any  possible  non-quadratic  behavior  of  the 
objective  function.  Moreover,  Powell  (1977)  has  shown  that  the  use  of  (2.2)  may 
cause  slow  convergence  in  the  general  nonlinear  case  when  exact  linear  searches 
are  made.  The  orthogonality  of  y*  to  pk+1  in  the  absence  of  an  exact  linear  search 
is  particularly  important  because  it  allows  the  traditional  conjugate-gradient  al¬ 
gorithm  to  be  generalised  to  perform  inexact  linear  searches.  The  ability  to  use 
inexact  linear  searches  is  a  prerequisite  for  any  algorithm  designed  to  minimize 
bound-constrained  problems. 

The  finite  termination  property  of  conjugate-gradient  methods  on  quadratic 
functions  motivated  Fletcher  and  Reeves  to  abandon  the  use  of  (2.1)  after  a  cycle 
of  n  linear  searches  and  set  as  the  steepest-descent  direction,  — 9fc+|.  This 
strategy  is  known  as  restarting  or  resetting  .  Restarting  with  the  steepest-descent 
direction  is  based  upon  the  questionable  assumption  that  the  reduction  in  F(x) 
along  the  restart  direction  will  be  greater  than  that  obtained  if  the  usual  formula 
were  used. 

We  shall  refer  to  the  conjugate-gradient  algorithm  that  uses  (2.3)  for  the 
definition  of  fa  and  restarts  with  the  steepest-descent  direction  every  n  iterations 
as  the  tradition *1  conjugate-gradient  method. 


2.2  The  traditional  conjugate-gradient  method  with  inexact  linear  searches 

The  difficulty  and  cost  of  finding  the  exact  minimum  of  F[x)  along  pk  have 
resulted  in  many  implementations  of  the  traditional  conjugate-gradient  method 
allowing  values  of  o*  which  do  not  necessarily  give  a  zero  directional  derivative 
9(x*  -f"  OkPk)TPk-  For  example,  in  the  implementation  of  Fletcher  and  Reeves  a* 
is  computed  by  taking  increasing  multiples  of  a  scalar  until  a  point  is  reached 
with  a  positive  directional  derivative.  This  point  and  the  latest  point  at  which  the 
directional  derivative  is  negative  are  then  used  as  a  basis  for  cubic  interpolation. 
This  interpolation  is  continued  until  a  point  is  obtained  at  which  the  function 
value  is  less  than  F(x«). 

For  a  quasi-Newton  method  there  are  essentially  two  conditions  that  must  be 
satisfied  if  convergence  is  to  be  guaranteed.  The  first  condition  is  that  the  function 
F[x)  must  be  sufficiently  reduced  at  each  iteration.  The  second  condition  is  that 
the  search  direction  must  not  become  arbitrarily  close  to  being  orthogonal  to  the 
steepest-descent  direction;  this  is  equivalent  to  requiring  that  — J^Pk/IISkltallPklh 


is  greater  than  a  constant  that  is  bounded  away  from  sero. 

The  first  condition  is  satisfied  for  any  a*  computed  by  the  following  step- 
length  algorithm. 

Step-length  algorithm  QNSL 

Let  {aJ)  define  a  sequence  of  points  that  tend  in  the  limit  to  the  minimum 
of  F(x)  along  p *.  (If  F(x)  is  smooth  this  sequence  can  be  computed  by  means  of 
some  safeguarded  polynomial  interpolation  algorithm.)  Let  t  be  the  index  of  the 
first  member  of  this  sequence  such  that 

1 9[*k  +  o‘Pk)TPkl  <  ~V9TkPk,  (cl) 

where  rj  (0  <  r;  <  1)  is  some  constant  scalar.  Let  p  (0  <  p  <  $)  be  another 
constant  scalar.  Find  the  smallest  non-negative  integer  r  such  that 

Fk  -  F{xt  +  J-'oV*)  £  -2-Wpglpt  (c2) 

and  set  a*  *=-  2 ~tai.  g 

If  at  is  computed  according  to  this  rule  it  can  be  shown  (Gill  and  Murray, 
1973)  that 

F. -F<*  +  ««>  >♦(=£-).  PS) 

where  ^  is  a  function  such  that,  for  any  sequence  (c*), 

lim  ^(efc)  -■  0  implies  lim  c*  —  0.  (2.6) 

k—*oo  k—+oo 


An  important  property  of  conditions  (cl)  and  (c2)  is  that  if  p  is  chosen  as  a  small 
value  (say  10-4)  then,  unless  F{x)  is  a  pathologically  ill-behaved  function,  any 
value  o1  satisfying  (cl)  automatically  satisfies  (c2)  with  r  ■■  0.  In  this  case  the 
step-length  algorithm  reduces  to  finding  a  scalar  a*  such  that 

|y(*s  +  otpt)TptH  <  — q$*  p*- 

The  value  of  rj  can  be  specified  by  the  user  and  can  be  used  to  give  a  step  length 
that  is  well-suited  to  the  problem  being  solved.  If  q  is  chosen  as  0.9  the  algorithm 
will  generally  compute  a  “crude"  value  of  a*,  provided  it  satisfies  (c2).  This  value 
is  usually  o°,  the  value  used  to  start  the  iterative  method  that  computes  the  min¬ 
imum  of  F[x)  along  p*.  If  q  is  chosen  as  sero,  o*  will  be  the  point  that  minimizes 
F[x)  along  p* 


It  is  important  to  note  that  a  step-length  algorithm  based  upon  the  inequalities 
\g(xk  +  akPk)TPk\  <  —wlpk  and  Fk  —  Fk+\  >  —nakg[pk 

for  fixed  values  of  p  and  r)  is  not  satisfactory  since  no  member  of  the  sequence 
{oJ}  may  satisfy  these  inequalities  simultaneously. 

It  is  a  feature  of  both  Newton-type  and  quasi-Newton  methods  that,  as  the 
iterates  approach  the  solution,  the  unit  6tep  will  invariably  satisfy  both  the  con¬ 
ditions  (cl)  and  (c2).  This  property  is  useful  because  o°  =*  I  can  be  used  to  initiate 
the  sequence  {oJ}.  If  by  chance  unity  is  not  an  acceptable  value  for  ak,  the  next  in¬ 
terpolated  value  generally  will  be.  This  property  of  quasi-Newton  methods  implies 
that  it  usually  requires  an  average  of  between  one  and  two  function  evaluations 
only  per  iteration  to  obtain  overall  convergence. 

If  we  are  to  apply  a  similar  step-length  algorithm  to  conjugate-gradient 
methods,  it  is  necessary  to  add  an  additional  condition  on  a1  .  If  (2.1),  the  formula 
for  pfc+i,  is  pre- multiplied  by  gk+\  we  find  that 

gl+iPk+i  —  —  gl+igk+i  -f  &$r+iP*-  (2-7) 

If  an  exact  linear  search  is  made,  g* is  xero  and  the  direction  pk  + 1  must  be  a 
descent  direction  with  respect  to  ja+i  since  g tPk+i  is  just  the  negative  quantity 
— 9/k*  |9k  +  i-  However,  if  an  arbitrary  inexact  linear  search  is  made,  the  point  x*+i 
may  be  such  that  0kgl^.lpk  is  positive  and  larger  than  —  Consequently, 

there  is  the  possibility  that  p*4i  will  not  be  a  descent  direction  since  gk 4  | Pa-t- 1 
may  be  positive  or  xero.  A  typical  remedy  for  such  an  eventuality  is  to  restart  the 
algorithm  with  pk  4-1  **  the  steepest-descent  direction.  Consequently,  a  "crude" 
step-length  may  c^use  the  efficiency  of  the  algorithm  to  be  severely  impaired  by 
the  use  of  a  large  number  of  steepest-descent  iterations. 

This  problem  is  easily  overcome  if  algorithm  QNSL  is  used  with  an  additional 
condition  on  the  termination  of  the  sequence  {cr,J.  The  following  algorithm  is  based 
on  a  useful  property  of  conjugate-gradient  methods  whereby  the  initial  directional 
derivative  for  the  next  iteration  can  be  computed  relatively  cheaply  during  the 
computation  of  the  sequence  {a*}. 

Step-length  algorithm  TCGSL 

Let  a  be  1  small  positive  scalar  and  Pk+\  «nd  0k  denote  the  quantities 
Sfc+i’  PM- 1  *nd  0k  respectively,  computed  at  the  point  Xk~f'°<Pk-  Compute  a 1  as 
in  algorithm  QNSL  but  with  the  additional  condition  that 

— tl+iPk+x  >®ll5a+,llallPa-MllJ-  («*) 
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If  this  condition  is  not  satisfied  we  proceed  with  computing  the  sequence  {o'}  until 
it  is  satisfied  or  the  minimum  along  pk  is  found,  g 

Note  that  pk+ ,  need  not  be  computed  explicitly,  since  <?£+ jPk+i  can  be  com* 

puted  by  using  9k+i9k+i»  ~&k  &n<^  fl+lP*-  ^  ^rst  va'ue  °f  a<  satisfies  the 
condition  (c3)  then  there  is  no  extra  cost  involved  in  this  test  since  these  quantities 
are  used  elsewhere  in  the  algorithm. 

A  desirable  feature  for  any  step-length  algorithm  is  the  facility  for  imposing  an 
upper  bound  \  upon  the  step  length;  that  is,  the  final  ok  must  satisfy  ak||pA||2  <  h. 
This  bound  may  be  used  in  several  ways:  to  prevent  overflow  in  the  user-6upplied 
function  routines  (see  Section  7,  for  example);  to  increase  efficiency  by  ensuring 
that  F(i)  is  evaluated  only  at  sensible  values  of  x;  to  prevent  the  step-length 
algorithm  from  returning  an  inordinately  large  step  because  no  smaller  step  would 
satisfy  the  convergence  criterion;  to  guarantee  that  the  new  point  xk+i  remains 
feasible  when  the  step-length  routine  is  being  used  for  constrained  minimization. 
It  is  clear  that  the  convergence  of  an  algorithm  that  relies  upon  an  exact  linear 
search  in  the  computation  of  ak  will  be  impeded  if  the  distance  to  the  bound  is  less 
than  the  step  to  the  minimum.  For  the  traditional  conjugate-gradient  method, 
the  implications  are  more  serious  because  the  computation  of  the  restricted  step 
may  prevent  pk+i  from  being  a  descent  direction.  From  (2.7)  it  is  clear  that  if 
i*  positive  then  gl^  jPk+i  may  also  be  positive.  If  a*  is  restricted  to  be 
less  than  some  value  h*  (say),  there  may  be  no  acceptable  value  of  a  for  which 
siT+jPk+l  >«  negative. 

Another  disadvantage  of  the  traditional  conjugate-gradient  method  with  in¬ 
exact  linear  searches  is  that  the  algorithm  will  be  efficient  only  for  problems  for 
which  the  gradient  vector  can  be  computed  with  approximately  the  same  cost  or 
less  as  the  objective  function.  If  g(x)  is  expensive  to  compute,  then  the  sequence 
{a*}  should  be  computed  by  a  method  utilizing  function  values  only;  for  example, 
if  F(s)  is  smooth,  the  sequence  may  be  computed  by  means  of  safeguarded  quad¬ 
ratic  interpolation.  Under  these  circumstances  condition  (cl)  should  be  replaced 
by 

F(xk  +  i/pO  —  F(xk  -f  o'pk)  <  — rj(i/  —  al)glpk, 

where  v  is  any  estimate  of  the  optimal  a  such  that  u  <  o*  (see  Gill  and  Murray, 
1974c).  Unfortunately,  we  cannot  avoid  the  computation  of  the  gradient  at  the 
point  x*  -f-  o*p k,  as  it  is  required  to  check  that  condition  (c3)  holds.  This  implies 
that  a  relatively  accurate  linear  search  must  always  be  made  within  a  conjugate- 
gradient  method  that  has  been  adapted  to  accept  finite-difference  approximations 
to  the  derivatives.  This  will  impair  the  efficiency  of  the  algorithm  on  those  problems 
for  which  the  coat  of  a  function  evaluation  dominates  the  cost  of  an  iteration. 


It  it  a  property  of  conjugate-gradient  methods  that  a*  often  varies  enormously 
in  magnitude,  even  when  the  relative  change  in  the  objective  function  is  very  small. 
This  implies  that  a  sensible  choice  of  a°  is  crucial  if  the  algorithm  is  to  be  efficient. 
In  general,  the  direction  of  search  rarely  approximates  the  Newton  direction  and 
consequently  the  unit  step  is  a  poor  choice  for  a°.  (It  is  precisely  because  of  this 
that  the  user  should  be  encouraged  to  place  reasonable  bounds  upon  the  variables 
whenever  possible.  The  use  of  bounds  in  this  way  iB  likely  to  prove  even  more 
effective  for  conjugate-gradient  methods  than  it  is  for  quasi-Newton  methods.) 

A  choice  of  initial  step  length  that  we  have  found  most  successful  is  one 
suggested  by  Davidon  (1959): 

„•  _  /- -  F..,)hln.  if  -2(f.  -  <  l;  ,,  g, 

\l.  if-2( 1  ' 

where  Ftti  is  a  user-specified  estimate  of  the  function  value  at  the  solution.  (It 
may  appear  unreasonable  to  expect  the  user  to  provide  such  an  estimate,  but  in 
our  experience,  it  is  rare  for  a  user  to  have  absolutely  no  idea  of  the  value  of  the 
function  at  the  solution.  If  F„ i  is  not  specified,  the  software  has  the  facility  for 
always  choosing  the  unit  step  length  for  a°.)  In  many  situations  the  use  of  (2.8) 
was  essential  in  order  to  avoid  overflow  during  the  computation  of  the  objective 
function  during  the  first  iteration.  A  unit  step  along  the  steepest-descent  direction 
will  often  compute  the  function  at  very  large  values  of  i  if  the  function  is  badly 
scaled. 

Alternatives  tried  were 


„o  _  2 (ft-, -Fa) 

a  T  i 

9lPk 

and  a  value  of  a0  which  was  increased  or  decreased  from  that  of  the  previous 
iteration  according  to  the  value  of  o*_j.  However,  since  neither  of  these  estimates 
was  particularly  more  effective  than  (2.8)  and  the  value  of  Ft,t  was  required  to 
prevent  the  possibility  of  overflow  during  the  first  iteration,  (2.8)  was  used  in  all 
the  algorithms  compared  in  Section  7. 

2.3  Beale's  method 

Although  the  function  is  guaranteed  to  decrease  along  the  steepest-descent 
restart  direction,  the  actual  reduction  is  often  poor  compared  with  the  reduction 
that  would  have  occurred  if  restarting  had  not  taken  place.  It  would  seem  useful, 
therefore,  if  a  cycle  of  n  iterations  could  commence  with  the  last  direction  of  the 
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previous  cycle.  However,  if  an  arbitrary  vector  it  used  as  the  initial  direction  in 
the  computation  of  the  sequence  (2.1),  then  finite  termination  will  not  occur  on 
quadratic  functions  because  the  vectors  {p*}  are  not  mutually  conjugate.  Beale 
(1972)  has  shown  that,  given  an  arbitrary  initial  direction  po,  the  sequence  of 
vectors 

Pk+i  “  — 1  “f  +  IkfK, 


where  &  —  vlaM-l/vIP*  *nd  7*  “  Vo9k+i/VoPo>  »re  mutually  conjugate.  The 
simple  extension  of  this  formula  to  nonlinear  problems  involves  computing  a  cycle 
of  n  directions 

Pk+1  “  —  Sk+i  +  PkPk  +  IkPi,  (2-9) 


where  0k  —  ylgk+i/vkPk  and  7*  —  yjgk+i/vjpt ■  The  direction  pt  is  known  as 
the  restart  direction  and  is  the  last  direction  of  the  previous  cycle  along  which  a 
linear  search  was  made.  Better  performance  will  be  obtained  if  the  coefficients  0k 
and  74  are  computed  as  the  solutioo  of  the  linear  system 


\yl  Pk  ylpiJK'ikJ  Vv‘*+1/ 


(2.10) 


This  ensures  that  p*.fi  is  orthogonal  to  both  yk  and  yi  even  when  F(x)  is  not 
quadratic.  These  equations  are  solved  by  a  single  bark  substitution  since  yjpk  •* 
fixed  at  xero  from  the  choice  of  0k— \  and  7k_|.  At  the  fc-th  iteration,  (2.9)  and 
(2.10)  are  used  to  compute  p*+i  unless  better  progress  can  be  made  by  using  (2.1). 
If  Beale’s  formula  is  considered  unsatisfactory,  a  new  cycle  commences  with  p*  as 
the  restart  direction  and  with  p*+i  computed  from  (2.1).  Note  that  if  (2.9)  is  used 
in  a  nonlinear  conjugate-gradient  algorithm,  p*+ 1  may  not  be  a  descent  direction, 
even  if  an  exact  linear  search  is  made 

Powell  (1977)  has  suggested  a  condition  for  restarting  based  upon  the  property 
that  the  gradient  vectors  are  mutually  orthogonal  in  the  minimization  of  a  quad¬ 
ratic  function.  If  the  gradient  vectors  are  not  sufficiently  orthogonal,  then  a  restart 
is  made.  Specifically,  a  restart  will  take  place  if 


(2.11) 


or  there  have  been  n  linear  searches  in  this  particular  cycle.  A  restart  will  take 
place  also  if  pk+i  is  not  sufficiently  downhill,  an  adequate  downhill  direction  being 
one  that  satisfies  the  ic  •'qualities 


-1.2||*+illa  <  <  -0-8||*+,||’. 


Apart  from  the  first  direction,  the  steepest-descent  direction  is  very  rarely  used; 
it  must  be  used,  however,  if  an  upper  bound  upon  the  step  length  prevents  the 
traditional  conjugate-gradient  direction  from  being  a  descent  direction. 

Beale's  algorithm  may  be  generalised  to  accept  inexact  linear  searches  by 
replacing  the  direction  in  condition  (c3)  by  the  direction  that  would  be  used 
if  a  restart  were  made  in  the  (Ifc-f-  l)-th  iteration.  Thus,  although  a  linear  search  is 
made  along  a  direction  generated  by  (2.9)  the  condition  (c3)  is  checked  by  means 
of  a  direction  computed  from  (2.1).  It  is  necessary  to  alter  the  test  for  p*+i  being 
sufficiently  downhill  to 

— 0I+1PM-J  >°IIPk+illjll9k-Mllj-  (212) 

The  reasons  for  this  will  become  clearer  in  Section  5  when  we  discuss  convergence 
of  the  algorithm. 

3.  Limited- memory  quasi-Newton  methods 

Although  we  have  shown  how  to  implement  the  conjugate-gradient  method 
with  an  inexact  linear  search,  the  resulting  algorithm  may  not  always  behave 
exactly  as  we  should  like  when  bounds  are  present  or  the  gradient  is  expensive  to 
compute.  For  this  reason  we  seek  methods  which  give  a  descent  direction  under 
much  milder  restrictions  upon  the  step  length.  In  this  section  we  consider  a  class 
of  such  methods. 

First  we  need  some  background  theory  on  quasi-Newton  methods.  A  quasi- 
Newton  method  computes  the  direction  of  search  as  the  solution  of  the  set  of 
equations 

Bk+\Pk+i  “■  —*+»*  (3-1) 

where  fis-H  *•  *n  approximation  to  the  Hessian  matrix.  At  each  iteration  the  ap¬ 
proximate  Hessian  is  updated  by  a  matrix  of  rank  two.  The  most  popular  updating 
formula  is  the  BFXSS  formula, 

B*+l  ~  a  _  ?[k*BM‘TtBk  +  (3  2) 

where  yt  —  ys-f  i  —  9k  and  **  -»  —  x*  »  OkPk-  (The  reader  should  note  that 

the  notation  used  here  differs  slightly  from  that  of  the  general  literature  on  quasi- 
Newton  methods.  It  is  customary  to  order  the  fc-th  iteration  of  a  quasi-Newtop 
method  so  that  pk  is  computed  first,  with  the  approximate  Hessian  being  updated 
last  of  all.  In  order  to  make  the  notation  consistent  with  that  of  conjugate-gradient 
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metbodi  our  quasi-Newton  scheme  compute*  the  direction  for  the  next  iteration 
after  updating  the  approximate  Hessian.) 

There  are  many  alternative  updating  formulae,  but  they  all  generally  satisfy 
the  quasi-Newton  condition: 

Bk+  iH  —  Mk- 

In  this  paper  we  shall  mainly  consider  updating  formulae  from  the  Broyden 
one- parameter  family 

Bf+t  — Bt - - — +  -£-VkvI  +  M'lBfiJwkvl, 


where 


-f-Vk — rrr~^*ki 

w*  ijBfik 


and  d*  is  a  scalar  function  of  yt  and  £f£s*  (see  Broyden,  1970).  For  all  positive  d*, 
it  can  be  shown  that  Bf+i  will  be  positive  definite  if  Bf  is  positive  definite  and 
yJ'ij  ii  positive.  The  positive  definiteness  of  f?^|  ensures  that  Pk+i  i6  a  descent 
direction. 

The  quasi-Newton  direction  can  also  be  computed  by  updating  the  inverse  of 
the  approximate  Hessian,  and  computing 

P»+i  —  —  /4+ifc+i-  (3-3) 

For  example,  the  updating  formula  for  the  approximate  inverse  which  corresponds 
to  the  BFGS  formula  for  the  approximate  Hessian  matrix  itself  is  given  by 

Hk-k-i  —  H*  —  4"  •*1/*^*) 

v**+^<i+!Sf^  ,3-4) 

For  each  update  of  Bf  I or  the  approximate  Hessian,  there  is  a  corresponding 
update  for  the  approximate  inverse  Hessian.  In  the  following  discussion  we  shall 
need  to  refer  to  different  updating  formulae  without  explicitly  displaying  them. 
Let 

denote  the  updating  formula  used  for  a  particular  value  of  d- 

Perry  (1977)  and  Shanno  (1978a)  have  considered  methods  based  upon  com¬ 
puting  p*+j  as  —Ha+i9k+\  where  /4+j  is  a  matrix  obtained  by  updating  the 
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identity  matrix  with  a  limited  number  of  quasi-Newton  corrections.  The  storage 
of  an  n  X  n  matrix  is  avoided  by  storing  only  the  vectors  that  define  the  rank- 
two  corrections,  and  consequently  Shanno  calls  such  methods  “memory less  quasi- 
Newton  methods".  Since  the  methods  require  non-negligible  storage,  which  may  be 
substantial  on  some  occasions,  we  prefer  the  term  limited- memory  quasi-Newton 
methods.  The  precise  method  depends  upon  the  number  of  updating  vectors  stored 
and  the  quaai-Newion  updating  formula  used.  For  example,  the  direction  obtained 
with  the  “one-step"  limited-memory  BFGS  update  is  given  by  (3.3),  using  (3.4) 
with  Hjt  equal  to  the  identity  matrix,  vis 


PM- 1 


— SM- 1  +  j^(*Zsh-Hl*  +  vIsmi**) 


•IWl 

vU 


(1  + 


vIjA 

vi*k 


)**. 


(3.5) 


Although  the  direction  of  search  is  computed  as  the  product  of  a  matrix  and  a 
vector,  the  matrix  /&+i  is  never  computed  explicitly. 

It  may  be  verified  by  inspection  of  (3.5)  that  if  the  one-step  limited-memory 
BFGS  formula  is  applied  with  an  exact  linear  search  then  p*  +  i  is  identical  to  the 
direction  obtained  from  the  conjugate-gradient  method  (all  vectors  multiplied  by 
•t9k+ 1  vanish).  Hence  it  is  possible  for  the  limited-memory  quasi-Newton  method 
to  generate  mutually  conjugate  directions,  but  only  if  an  exact  linear  search  is 
made.  This  is  demonstrated  as  follows:  if  the  quasi-Newton  condition  is  written 
in  terms  of  the  inverse  Hessian  we  have  **  — *  %+ita;  consequently 


VkPk-hl  “  vJW-H 

T 

-«;*+!• 

Thus  orthogonality  between  ja  and  p*+t  will  occur  only  if  i**  **  *rro>  **» 
if  an  exact  linear  search  is  made. 

Before  we  discuss  other  methods  related  to  the  BFGS  formula,  we  shall  describe 
a  general  r-step  formula  (for  more  details  see  Nasareth  and  Nocedal,  1978).  For 
an  r-step  formula,  the  matrix  /&+>  is  implicitly  defined  by  r  approximate  Hessian 
matrices  C/j,  . . . ,  V,  such  that 


/&+I  “  r#(^/r»  |fc,i  S»,)> 

where  the  subscript  d  denotes  any  quasi-Newton  updating  formula,  o,  is  one  of  the 
indices  of  the  r  pairs  of  vectors  and  each  U,+i  is  related  to  its  predecessor 

by  the  rule 

t/>+i  ■*  s*,). 
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with  U\  usually  equal  to  the  identity  matrix.  The  r  pairs  of  vectors  {y,,*,}  can 
be  selected  from  any  of  those  defined  in  earlier  iterations  and  any  combination 
of  quasi-Newton  updates  can  be  used.  (Strictly,  we  should  have  a  suffix  j  on  the 
parameter  but  we  have  attempted  to  keep  the  notation  as  simple  as  possible.) 
In  this  notation  we  can  define  two  methods  based  on  the  BFGS  formula: 

The  one-step  BFGS  formul a 

At  each  iteration  define  U\  as  the  identity  matrix  and  set 

J&+1  "*  rflfcsM.  Vk>*k)-  (3  ®) 

The  two-step  BFGS  formul a 

At  each  iteration  define  fJ\  as  the  identity  matrix  and  set 

Mt  ■“  FarcsiUh /•»  7v 
Hk+i  «=■  rarcs(M»»  Mb  •ki¬ 
ll  the  vectors  {UjVo,)  and  {«*,}  are  known  for  j  =  l,...,r,  then  Hk+\9k+\ 
can  be  computed  by  the  sequence  of  linear  combinations 

MiSk+i  “ 

Mish-fi  —  k{Miyk+i. Miu»f.^i} 

•  • 

A  set  of  r  pairs  of  vectors  define  a  single  approximate  Hessian 

matrix 

Nasareth  (1979)  suggests  a  limited-memory  quasi-Newton  method  in  which 
as  many  pairs  of  vectors  {y,,*,}  are  kept  as  storage  will  allow.  Unfortunately, 
storing  only  y,  and  *,  substantially  increases  the  cost  of  computing  the  direction  of 
search  since  an  additional  r(r  +  1  )/2  linear  combinations  must  be  used  to  compute 
the  vectors  UjVo,  This  additional  cost  is  negligible  for  small  values  of  r,  but  care 
must  be  taken  lest  the  work  required  to  compute  the  direction  of  search  should 
dominate  the  computing  cost  of  a  single  iteration. 

The  justification  for  using  limited-memory  quasi-Newton  formulae  is  that  p*+ 1 
is  guaranteed  to  be  a  descent  direction  if  all  the  inner  products  yjs,  are  positive 
for  all  vectors  y,  and  a,  used  in  the  updating  formulae.  These  inner  products  will 
nearly  always  be  positive  if  the  step-length  is  computed  by  means  of  algorithm 


QNSL.  Suppose  that  conditions  (cl)  and  (c2)  are  satisfied  with  r  equal  to  zero. 
Condition  (cl)  implies  that 

0  <  (s(*fc  +  °kPk)  —  19k)TPk  <  VkPk, 

which  implies  that  y*s*  is  positive  since  a*  is  positive.  There  it  a  very  slight  chance 
that  Pk+i  will  not  be  a  descent  direction  if  the  enforcement  of  condition  (c2)  causes 
yjCsfc  to  be  negative.  However,  this  must  imply  that,  for  a  step  length  smaller  than 
a\  the  directional  derivative  at  j*+j  must  be  larger  than  at  x*.  The  function  F(x) 
would  need  to  be  highly  irregular  for  this  to  occur.  If  p*+ 1  is  not  a  descent  direction 
and  a  full  quasi-Newton  method  is  being  used,  then  the  approximate  Hessian  is 
not  updated  during  that  iteration.  For  a  one-step  limited-memory  quasi-Newton 
method,  the  same  strategy  suggests  taking  a  steepest-descent  step.  For  a  multi- 
step  formula,  some  of  the  updates  will  need  to  be  ignored. 

Restarts  can  be  incorporated  into  the  limited-memory  quasi-Newton  scheme 
by  performing  a  two-step  update  using  **,  y*  and  information  from  a  restart:  s( 
and  y*.  Shanno  (1978a)  has  suggested  the  following  algorithm: 

Uj  -»  ,-Rv 

H*+ 1  "=  rBFCs{Ui, 
where  0  denotes  the  self-scaled  BFGS  formula 


—  iHk  —  +  *kV  kHk) 

1  /t  ,  vlHk yk,  T 


with  7  vlnlvlHkVk.  (See  Oren,  1974  and  Oren  and  Spedicato,  1S76,  for  a 
discussion  of  the  self-scaled  BFGS  formula.)  This  algorithm  is  one  of  those  tested 
in  Section  7. 


4.  Preconditioned  eonjugsts-gradiant  methods 

Preconditioning  has  been  used  for  some  time  to  aid  the  solution  of  linear  equa¬ 
tions  by  conjugate-gradient  methods.  The  purpose  of  preconditioning  is  to  alter  the 
coefficient  matrix  of  the  problem  so  that  its  condition  number  is  reduced.  It  is  hoped 
that  this  action  will  reduce  the  number  of  ’terations  required.  Preconditioning 
can  also  be  regarded  as  a  technique  which  capitalises  on  any  structure  occurring 
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ill  the  coefficient  matrix,  and  it  is  from  this  viewpoint  that  we  shall  discuss  the 
application  of  preconditioning  to  nonlinear  problems. 

Suppose  we  are  minimising  a  quadratic  function  cTx  -f  ^xTQx  where  the 
Hessian  matrix  Q  can  be  written  in  the  form 

Q-M-N, 

with  M  a  positive-definite  symmetric  matrix  which  is  easy  to  invert  (for  example, 
M  might  be  diagonal).  The  procedure  for  computing  the  direction  of  search  may 
be  generalised  as  follows  (see  Concus  et  si.,  1976).  Solve  the  linear  equations 

M*k+ 1  ■“  —  0k+» 


and  set 

Pt+l  -•  «fc-H  +  ftpfc,  (4-1) 

where  fin  —  — l/[*k+i/y[pk •  As  in  the  traditional  conjugate-gradient  algorithm, 
the  directions  {p*}  are  mutually  conjugate  with  respect  to  the  matrix  Q,  but  the 
algorithm  has  the  additional  property  that  the  vectors  {**}  are  conjugate  with 
respect  to  the  matrix  M. 

This  algorithm  may  be  extended  to  nonlinear  problems  by  using  a  matrix  A& 
that  varies  from  iteration  to  iteration;  for  example,  Naxareth  and  Nocedai  (1978) 
suggest  using  an  r-step  limited-memory  approximate  Hessian  matrix  for  If 

the  Hessian  matrix  is  sparse,  we  can  form  a  complete  approximation  to  the  Hessian 
matrix  by  using  finite  differences  (Gill  and  Murray,  1973;  Curtis  et  al.,  1974),  or  a 
sparse  analogue  of  any  of  the  well  known  updating  formulae  (Toint,  1977;  Shanno, 
1978b).  If  we  denote  the  approximate  Hessian  as  Bk+\  the  iteration  requires  the 
solution  of  the  sparse  equations 

#a+iAH-i  "*  (4-2) 

One  of  the  most  effective  methods  for  solving  the  sparse  positive-definite  equations 
Ax  ■■  6  is  to  form  the  LDLT  factorisation 

A  -  LDLr, 

where  L  is  a  unit-lower  triangular  matrix  and  D  is  a  diagonal  matrix.  The  vector 
x  may  then  be  found  by  successively  solving  the  triangular  systems 

Lv  —  h  and  Lrx  ™  D~lv. 
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Unfortunately,  whatever  the  method  used  to  approximate  the  sparse  Hessian, 
23fe+i  can  not  be  guaranteed  to  be  positive  definite.  If  indefiniteness  occurs,  not 
only  ia  the  Cholesky  algorithm  numerically  unstable  but  also  pk+\  may  not  be 
a  descent  direction.  We  can  remove  both  these  difficulties  by  using  the  modified 
LDLt  factorisation  of  the  matrix  j.  The  modified  LDLr  factorization  of  a 
matrix  A  is  such  that 

A  =*  LDLr  —  E, 

where  L  and  D  are  defined  as  before  and  E  is  a  diagonal  matrix  which  is  zero  if 
A  is  positive  definite  (see  Gill  and  Murray,  1974a). 

If  there  is  little  fill-in  during  the  factorisation  it  is  not  worthwhile  computing 
the  direction  p*+i  since  4+1  is  a  perfectly  adequate  direction  of  descent  (compare 
Equations  3.1  and  4.2).  However,  there  are  some  problems  for  which  the  amount  of 
fill-in  is  too  large  for  the  factors  to  be  held  in  core.  In  these  situations  the  modified 
LDLt  factorization  can  be  computed  so  that  any  unwanted  fill-in  is  ignored.  In 
this  case 

A~LDLt- S, 

where  the  matrix  £  is  not  a  diagonal  matrix.  Alternatively,  if  the  Hessian  matrix  has 
some  specific  structure,  such  as  a  band  structure,  but  there  are  some  “complicating 
elements’*  which  cause  significant  fill-in  during  the  factorization  process,  these 
elements  can  be  ignored.  Techniques  of  this  kind  have  been  used  successfully  to 
precondition  linear  systems  (see  Meijerink  and  Van  der  Vorst,  1977;  Munksgaard, 
1979). 

Techniques  based  upon  matrix  factorizations  are  specifically  designed  for 
sparse  systems.  Since  we  are  primarily  concerned  with  problems  whose  Hessian 
matrix  cannot  be  stored,  we  shall  not  consider  these  methods  further  in  this  paper. 

One  technique  that  can  be  applied  to  general  problems  is  a  preconditioning 
based  upon  a  diagonal  scaling.  If  the  direction  of  search  is  obtained  from  Equation 
(3.1)  the  quasi-Newton  formulae  from  the  Broyden  class  may  be  simplified  so  that 
the  matrix  Bk  does  not  appear  in  the  rank  two  correction;  for  example,  the  BFGS 
formula  becomes 

s,+l"fl,+ + (°> 

This  result  implies  that  even  if  the  ofT-diagonal  elements  of  Bk  are  unknown,  the 
diagonal  elements  can  still  be  recurred.  These  diagonal  elements  may  be  used  to 
precondition  the  conjugate-gradient  method.  Let  and  denote  the  j-th  elements 
of  gk  and  y*  respectively.  If  A*+i  -■  diag(JI( . . .  ,J„)  and  A*  —  diag(6i, . . .  ,6n) 
denote  the  approximate  diagonal  Hessians  during  the  (fc-f*  l)-th  and  fc-th  iterations 
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respectively,  then 

,'"*+aK* 

(4.4) 

and 

(4.5) 

The  BFGS  recurrence  of  A*  has  some  theoretical  justification  since,  in  the 
quadratic  case  with  exact  linear  searches,  the  search  directions  generated  by  the 
two  algorithms  are  identical  (see  Nazareth,  1979).  The  motivation  for  using  this 
algorithm  on  general  nonlinear  functions  is  to  scale  the  search  directions  so  that 
the  initial  step  along  pu  will  be  a  better  prediction  of  the  minimum.  We  shall 
demonstrate  in  Section  7  that  this  is  indeed  the  case. 

To  ensure  that  the  elements  of  A*+i  are  positive,  a  diagonal  is  not  updated  if 
the  result  will  be  negative.  An  algorithm  must  also  be  used  to  prevent  the  condition 
number  of  A*+i  from  becoming  excessive.  If  fl  denotes  a  preset  bound  on  the 
condition  number  of  j  and  k  >  17  where  a  —  $»**•*/$  m»».  then  we  use  6j  for 
the  new  diagonals,  where  w  ™  log  17/  log*.  The  value  of  the  bound  k  is  the  same 
as  that  used  by  Gill  et  ai.  (1972b),  viz 

f7  -  l/(100n*e), 

where  c  is  the  relative  machine  precision.  It  should  be  noted  that  this  bound  on 
the  condition  number  was  never  achieved  in  any  of  the  computer  runs  discussed 
in  Section  7. 

The  traditional  conjugate-gradient  algorithm  and  Beale's  algorithm  precondi¬ 
tioned  by  a  diagonal  matrix  still  suffer  from  the  disadvantage  that  the  rate  of  con¬ 
vergence  may  deteriorate  if  there  is  a  bound  upon  the  step  length.  Fortunately,  the 
limited-memory  quasi-Newton  methods  can  also  be  generalised  to  accept  diagonal 
preconditioning.  In  the  r-step  formula,  the  sequence  of  r  approximate  Hessian 

matrices  U\ . Ur  is  started  with  U\  equal  to  a  diagonal  preconditioning  matrix 

rather  than  the  identity  matrix.  For  example,  the  preconditioned  two-step  BFGS 
algorithm  is  given  by: 

The  diagonally  preconditioned  two-step  BFGS  formula 

At  each  iteration  define  L/j  as  the  diagonal  matrix  A^+,  and  set 
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5.  Some  comment*  on  convergence  proof* 

It  is  important  to  realise  that  all  proofs  concerning  the  convergence  of  conjugate- 
gradient  algorithm*  are  theoretical  in  the  sense  that  they  ignore  the  efTccts  of 
rounding  error.  Rounding  error  may  upset  some  of  the  most  fundamental  theory 
of  conjugate-gradient  algorithms,  as  the  following  example  will  illustrate. 

Consider  the  quadratic  function 


*M- (i-*/. 

>-> 

where  d,  —  ( j/n)p ,  p  >  0.  For  all  values  of  n  and  p  this  function  has  the  solution 
x*  ■■  (I, . . . ,  l)r,  at  which  point  F(x)  is  xero.  The  Hessian  matrix  of  this  function 
is  of  the  form  Q  -■  diag((l/n)p,  (2/n)p, ....  I)  and  has  condition  number  np.  For 
p  3  and  n  *=  50  the  condition  number  of  Q  is  1.25  X  105.  This  case  was  run  on 
an  IBM  370/168  using  double- precision  arithmetic,  the  relative  machine  precision 
being  approximately  1.0  X  10~15.  In  theory,  the  conjugate-gradient  algorithm 
should  give  zero  values  of  F(x)  and  ||y(x)||  at  iteration  n.  However,  the  function 
and  gradient  values  were  9.8  X  10~5  and  9.8  X  10~4  respectively.  At  iteration  2r» 
the  function  value  and  gradient  norm  were  1.3  X  10~*  and  8.0  X  10“ 5.  Clearly 
the  property  of  n-step  termination  does  not  apply  in  this  case,  yet  the  condition 
number  of  Q  is  not  excessive  in  relation  to  the  precision  of  the  arithmetic  being 
used. 

It  is  illuminating  to  compare  this  result  with  that  obtained  by  another  method 
which  theoretically  should  compute  exactly  the  same  numbers  when  minimizing  a 
quadratic  function.  Nasareth  (1979)  has  shown  that  on  a  quadratic  function  with 
exact  linear  searches,  the  search  directions  generated  by  the  BFGS  method  are 
identical  to  those  generated  by  the  traditional  conjugate-gradient  method.  If  we 
use  the  BFGS  formula  to  minimize  the  quadratic  function  (2.5)  we  find  that  at  the 
50th  iteration  the  function  value  is  2.1  X  10~~*  and  the  norm  of  the  gradient  vector 
is  8.8  X  10  *.  However,  at  iteration  55  the  method  recovers  and  the  function  value 
is  essentially  zero  at  4.1  X  10~J*  with  gradient  norm  1.5  X  10— ,4. 

These  results  indicate  that  in  practice,  even  on  a  quadratic  function,  the  se¬ 
quence  generated  by  a  conjugate-gradient  algorithm  will  be  infinite.  Consequently, 
we  must  not  be  surprised  if  behavior  predicted  by  a  theoretical  result  based  upon 
some  n-step  property  is  not  observed  in  practice.  The  most  useful  theorems  are 
those  which  provide  negative  results,  that  is,  results  which  tell  us  that  convergence 
will  not  occur,  even  if  infinite-precision  arithmetic  is  used.  For  example,  one  such 
result  was  established  by  Powell  (1976):  if  the  initial  direction  of  search  for  a 
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sequence  of  iterates  designed  to  minimize  a  quadratic  function  is  not  the  slccpest- 
descent  direction  then  the  traditional  conjugate-gradient  method  will  converge 
only  at  a  linear  rate. 

The  moat  powerful  convergence  proof  for  the  traditional  conjugate-gradient 
algorithm  with  exact  linear  searches  was  established  by  Powell  (1977).  However, 
for  Powell’s  proof  to  be  correct,  it  is  necessary  to  modify  the  statement  of  the 
theorem. 

Theorem  1  (Convergence  with  exact  linear  searches). 

Let  (xk)  and  (pk)  be  computed  from  the  formulae  (2.1)  and  (2.3)  for  all  k. 
Choose  each  ok  such  that  it  is  the  local  minimum  along  pk  which  is  the  smallest 
in  magnitude.  If  the  set  {x  |  /'(x)  <  f’(xo) }  is  bounded,  if  y(x)  is  continuous  and 
if  the  quantities  IJxs.fi  — xk||z  tend  to  zero,  then 

lim  M  0.  | 

k-*oo 

Our  specification  of  the  sequence  differs  from  that  of  Powell  in  the  require¬ 
ment  that  each  a*  be  the  local  minimum  along  p*  that  is  closest  to  xk.  This  must 
be  done  to  ensure  that  members  of  the  sequence  {ak}  satisfy  the  condition  (2.5) 
(see  Ortega  and  Rheinbolt,  1970,  pp.  483-484).  If  the  sufficient  decrease  is  not 
made,  the  proof  given  by  Powell  does  not  hold  since  (2.5)  is  needed  to  show  that 
limk_oo  9^Pk/\\Pk\\  “  0  snd  hence  that  limk_oo||yk||  *=  0.  The  unmodified  theorem 
breaks  down  under  the  following  circumstances. 

Let  be  the  minimum  of  /r(xk-}-apk)  closest  to  zero.  Suppose  that  the  initial 
member  of  the  sequence  {cr*}  lies  beyond  ok.  If  F(x)  is  a  non-unimodal  function  it 
is  possible  for  j(xk  -f-  o0pk)’rpk  to  be  negative  and  for  F  (xk  -f-  o°pk)  to  be  less  than 
Fk.  The  step-length  algorithm  will  then  proceed  to  locate  a  value  or  ok  that  is 
greater  than  ak.  When  this  occurs  the  function  may  be  reduced  by  an  arbitrarily 
small  amount  and  it  is  not  possible  to  derive  a  condition  analogous  to  (c2).  In 
practical  computation  it  is  relatively  simple  to  derive  an  acceptable  step  length 
from  any  local  minimum  of  F(xk~\~  <*Pk)  by  simply  reducing  the  step  by  multiples 
of  a  constant  less  than  unity.  Unfortunately  the  resulting  step  length  does  not 
give  the  required  orthogonality  between  gt+i  and  pk,  which  is  an  essential  feature 
of  Powell’s  analysis. 

It  must  be  emphasized  that  the  need  to  determine  the  first  minimum  along  pfc 
renders  the  algorithm  analysed  in  Theorem  1  unsuitable  for  computer  implemen¬ 
tation.  If  we  are  to  be  sure  that  ak  is  the  first  minimum  along  pk  for  general 
functions,  we  need  to  be  able  to  compute  all  the  minima  of  F(xk~\~°Pk)-  Nonetheless, 
Theorem  1  does  have  practical  relevance  for  convex  functions  or  functions  that 
are  unimodal  along  each  direction  p*. 
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We  shall  now  present  a  convergence  theorem  for  the  algorithm  with  inexact 
linear  searches,  which  is  analogous  to  Theorem  1. 

Theorem  2  (Convergence  with  inexact  linear  searches). 

Let  pi,  and  0k  he  computed  as  in  Theorem  1.  Let  ak  be  computed  by  means  of 
the  step-length  algorithm  TCGSL.  If  ok  is  computed  as  the  first  minimum  along 
pt,  only  on  those  occasions  that  the  condition  (c3)  is  not  satisfied,  then 

lim  HftH  —  0. 

fc-»oo 


Proof 

Gill  and  Murray  (1974c)  have  shown  that  the  step-length  algorithm  QNSL 
gives  an  ok  such  that  (2.5)  is  true.  Consider  the  step  lengths  generated  by  algorithm 
TCGSL.  If  a  value  of  o*  is  found  that  eventually  satisfies  (c3),  then  ak  will  satisfy 
(2.5)  since  it  can  be  regarded  as  being  computed  by  means  of  QNSL  but  with 
modified  values  of  rj  and  p.  Alternatively,  if  the  minimum  of  F(xk  +  apk)  is  found, 
then  (2.5)  is  seen  to  be  true  from  the  analysis  presented  by  Ortega  and  Rhcinbolt 
(1970,  pp.  484-486). 

Assume  that  the  theorem  is  not  true  and  that  a  limit  point  is  obtained  which 
is  not  a  stationary  point  of  F(x).  We  can  choose  an  integer  m  and  a  small  scalar 
c  such  that  for  all  Ac  >  m, 

ll»ll>£.  (5-1) 

a 


Since  the  quantities  Hx^+i  —  x*||  tend  to  xero  we  must  have  limfc_ooP*— P*+i  “  0. 
which  implies  from  (2.6)  that  for  m  sufficiently  large, 


<e. 


Thus  from  (5.1), 

T 

-- *±*-<o 

IlftllM 

for  all  Ac  >  m.  However  this  implies  that  a*  will  be  computed  with  an  exact 
linear  search  and  the  algorithm  will  behave  precisely  as  the  conjugate-gradient 
algorithm  discussed  in  Theorem  1.  Since  such  an  algorithm  is  convergent  we  have 
a  contradiction  and  the  theorem  must  be  true.  | 


McCormick  and  Pearson  (1969)  have  shown  that  for  a  wide  class  of  functions, 
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the  restarted  conjugate-gradient  method  is  r>-step  superlinearly  convergent,  i.e. 


lira 

)— *oo 


*  II 


0. 


The  proof  of  rv-step  superlinear  convergence  is  critically  dependent  upon  the 
use  of  restarting.  Powell  (1976)  has  shown  that  algorithms  that  do  not  contain  a 
restarting  strategy  almost  always  converge  linearly. 

The  reader  should  note  that  the  term  n-step  superlinear  convergence  has 
very  little  meaning  when  n  is  large  since  the  computation  of  enough  values  for 
the  asymptotic  convergence  theory  to  hold  would  be  infeasible.  In  our  view, 
any  conjugate-gradient  method  that  requires  more  than  2 n  or  3n  iterations  to 
achieve  any  meaningful  accuracy  should  be  considered  to  have  failed.  Conjugate- 
gradient  methods  are  intended  for  values  of  n  that  may  be  in  excess  of  1000  (for 
example,  in  molecular  chemistry,  problems  with  several  thousand  variables  are 
routine).  Computing  several  thousand  values  of  the  objective  function  is  likely  to 
be  prohibitively  expensive  for  all  but  the  simplest  of  problems. 

In  summary,  our  experience  is  that,  except  in  very  special  circumstances  (such 
as  when  F[z)  is  a  well-conditioned  quadratic  function),  the  conjugate-gradient 
method  is  alwsjrt  linearly  convergent,  regardless  of  whether  or  not  restarting  takes 
place. 

( 

6.  Restarting  strategies  and  extensions 

In  our  view,  the  accepted  theoretical  justiBcation  for  restarting  in  the  nonlinear 
case  needs  to  be  re-examined.  In  particular,  restarting  every  n  iterations  would 
appear  to  be  unnecessary  since  we  should  not  expect  to  perform  more  than  2n  or 
3 n  iterations  in  total.  It  is  unlikely  that  one  or  two  iterations  with  the  6tecpest- 
descent  direction  will  make  much  difference  to  the  progress  of  the  minimization. 

It  would  appear  that  the  same  argument  could  be  used  to  dismiss  the  case  for 
restarting  with  arbitrary  directions,  as  is  done  in  Beale's  algorithm.  However,  the 
results  of  Section  7  show  that,  on  average,  Beale's  algorithm  with  Powell  restarts 
generally  requires  fewer  iterations  and  function  evaluations  than  the  traditional 
conjugate-gradient  method.  Initially  we  found  this  puzzling  since  it  seemed  unlikely 
that  a  direction,  kept  for  what  could  be  a  cycle  of  thirty  or  forty  iterations,  could 
possibly  provide  any  useful  information.  However,  this  question  was  answered 
when  the  runs  were  closely  investigated.  On  many  of  the  problems  a  restart  was 
being  made  every  one  or  two  iterations.  It  was  extremely  rare  for  a  direction  to 
be  used  for  more  than  ten  iterations.  Clearly  in  these  cases,  the  term  “restarting'' 
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is  misleading.  The  fact  that  Powell's  restart  criterion  causes  frequent  restarts  is 
not  surprising,  since  a  criterion  based  upon  the  orthogonality  of  the  gradient  vec¬ 
tors  will  have  no  meaning  when  inexact  linear  searches  arc  made  (Powell's  paper 
considered  the  conjugate-gradient  method  with  exact  linear  searches). 

We  believe  that  the  major  contributing  factor  to  the  improvements  obtained 
by  restarting  is  the  additional  information  that  is  being  used  during  the  computa¬ 
tion  of  the  search  direction,  namely,  the  restart  direction  pt  and  its  corresponding 
gradient  difference  y<.  In  the  quadratic  case  the  conjugate-gradient  direction  pk+ 1 
of  (2.1)  is  essentially  of  the  form 


k 

Pfc+l  —  —  0k+l  +  X] 


(61) 


with  “  0k  and  mj}  =  0  for  0  <  ;  <  It.  In  the  nonlinear  case,  all  the  u>}  will  be 
nonsero  and  Beale’s  algorithm  can  be  interpreted  as  a  normal  conjugate-gradient 
iteration  with  a  recent  direction  being  maintained  in  the  recurrence. 

However,  there  are  some  aspects  of  restarting  that  remain  inexplicable.  In  a 
numerical  experiment,  two  implementations  of  Shanno’s  method  were  tested  on 
the  examples  listed  in  Section  7.3.  The  methods  were  identical  except  that  one 
algorithm  was  restarted  every  iteration  while  the  other  was  restarted  accordiisg  to 
Powell’s  criterion.  Although  the  number  of  function  evaluations  required  was  of  the 
same  order  of  magnitude  for  both  techniques,  the  algorithm  which  was  restarted 
every  iteration  required  more  evaluations,  on  average,  than  its  competitor,  despite 
the  fact  that  the  Powell  criterion  forces  a  restart  very  frequently.  In  our  opinion 
this  behaviour  is  not  adequately  explained  by  the  currently  accepted  theory. 

It  is  clear  that  we  need  to  be  able  to  compute  a  direction  incorporating  infor¬ 
mation  from  past  iterations  without  significantly  increasing  the  storage.  We  shall 
now  describe  a  technique  that  we  have  found  to  be  quite  successful  in  practice. 
After  a  sequence  of  iterations  of  a  conjugate-gradient  method  we  could  pretend 
that  we  had  obtained  the  current  point  by  taking  just  one  step  along  the  vector 
representing  the  total  change  in  the  variables.  We  would  expect  this  vector  to  be 
a  good  direction  of  descent  since  a  substantial  decrease  in  F(x)  must  have  already 
occurred.  A  feature  common  to  all  conjugate-gradient  type  methods  is  that  even¬ 
tually  very  small  steps  are  taken  with  very  little  reduction  in  the  function  value. 
We  can  regard  the  computation  of  the  total  change  in  x  as  a  method  of  identifying 
the  "trend"  of  a  sequence  of  smaller  steps.  We  can  utilize  these  observations  by 
modifying  any  of  the  conjugate-gradient  methods  so  that  a  cycle  of  iterations  is 
made  with  the  recurrence  relation  including  a  term  of  the  form 
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(6.2) 


*s  —  £•,—  **“  **• 

where  t  is  the  index  of  the  iteration  in  which  the  current  cycle  commenced.  The 
corresponding  change  in  the  gradient  is 


Vs  —  9k  —  *  —  2  W* 
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A  new  cycle  should  start  when  the  reduction  in  F(x)  begins  to  become  small  in 
relation  to  the  total  reduction  that  has  occurred  since  the  start  of  the  current 
cycle.  One  method  of  achieving  this  objective  is  as  follows. 

Let  be  a  scalar  which  is  constant  during  the  f-th  cycle;  a  new  cycle  starts 
at  iteration  k  (i.e.  t  is  set  equal  to  k)  if 

<  w+i-n+i),  (6.4) 

where  Ft+i  is  the  value  of  the  function  on  completion  of  the  first  iteration  of  cycle 
q.  The  parameter  will  be  larger  or  smaller  than  according  to  whether  or 
not  Ft  —  Ft+i,  the  reduction  in  F(x)  during  the  last  iteration  of  cycle  q,  is  larger 
or  smaller  than  —  Ft +a,  the  reduction  obtained  during  the  first  iteration  of 

cycle  q-f- 1.  The  precise  method  for  fixing  #f+i  is  as  follows:  initially,  Aj  ■-  10  2; 
after  the  first  iteration  of  the  (q  +  l)-th  cycle  which  started  at  iteration  t: 


IK. 


if  F,— f|+j  ^  $(F|+i  — Fi+a); 
if  Ft  —  Ft+i  >  2(Fi+j  — Fj+a). 


In  Section  7  we  describe  an  implementation  of  this  scheme  with  a  preconditioned 
two-step  BFGS  formula.  The  precise  algorithm  is  as  follows: 

The  diagonally  preconditioned  two-step  BFGS  formula  with  accumulated  step 

At  each  iteration  define  t/j  as  the  diagonal  matrix  and  set 

Uj  —  r btgs{Ui ,  Vk>  **);  (o  c) 

/4+i-T  btcM,*,*)-  1  ' 
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7.  Numerical  results  and  discussion 

In  this  section  we  discuss  the  numerical  behaviour  of  several  of  the  methods 
discussed  in  earlier  sections.  It  was  not  feasible  to  test  every  technique  that  has 
been  d  scribed,  but  we  have  attempted  to  select  those  algorithms  which  are  cither 
in  common  use  or  show  the  most  promise  from  a  theoretical  point  of  view.  The 
method  used  to  compare  algorithms  consists  of  applying  them  to  a  set  of  test 
problems.  We  do  not  claim  that  this  is  a  completely  satisfactory  means  of  com¬ 
parison,  but  we  believe  that,  if  the  test  problems  are  selected  carefully,  the  evidence 
obtained  can  be  a  valuable  aid  in  the  selection  of  the  best  algorithm. 


7.1  The  assessment  criterion 

All  optimization  software  requires  a  criterion  for  terminating  the  computation 
of  the  sequence  {**}.  Ideally,  if  we  wish  to  measure  the  comparative  efficiency 
of  routines  we  should  set  the  same  termination  criterion  in  all  the  routines  tested 
and  then  compute  the  cost  of  a  minimization,  in  terms  of  the  number  of  function 
evaluations  for  instance.  However,  there  is  no  universal  agreement  on  what  is  the 
best  termination  criterion  and  a  different  criterion  used  by  another  researcher  may 
result  in  a  wide  variation  in  the  accuracy  of  the  answer  obtained.  The  question 
remains,  therefore,  as  to  the  point  at  which  we  should  assess  the  efficiency  of  the 
various  methods.  The  assessment  criterion  used  here  is  to  take  the  first  point  x* 
for  which 

F.-C(j,)<r(l  +  |C(*')|),  (7.1) 

where  r  is  a  scalar.  Some  authors  have  argued  against  the  use  of  (7.1)  because  it 
includes  F(x*),  which  is  unknown  on  real  problems.  We  believe  that  such  authors 
are  confusing  an  assessment  criterion,  where  the  use  of  F(x*)  is  legitimate,  with  a 
termination  criterion,  where  it  is  not. 

If  the  criterion  (7.1)  is  to  give  a  realistic  assessment  of  the  performance  of  an 
algorithm,  the  choice  of  r  must  give  a  point  x*  which  is  close  to  a  final  estimate 
of  x*  obtained  with  a  realistic  termination  criterion.  The  relative  performance  of 
algorithms  with  superlinear  convergence  is  almost  invariant  with  the  choice  of  r 
and  a  very  small  value  can  be  used.  For  example,  on  an  IBM  370/168,  where 
the  function  can  be  computed  to  approximately  fifteen  decimal  places  in  double 
precision,  a  reasonable  choice  of  r  is  10-,°.  However,  for  conjugate-gradient  type 
methods,  which  exhibit  a  linear  rate  of  convergence,  the  performance  can  vary 
widely  with  the  choice  of  r.  It  is  not  unusual  for  the  number  of  function  evaluations 
to  be  three  times  greater  for  r  «=  10— 10  than  for  r  iO  *.  In  this  case  it  is 
important  that  a  moderate  termination  criterion  be  used.  In  all  the  tests  carried 
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out  for  this  study,  r  wu  aet  at  10  *. 


7.2  The  algorithms  tested 

The  results  of  Section  7.5  illustrate  the  numerical  behaviour  of  eight  algorithms 
for  large-scale  unconstrained  minimisation. 

The  following  four  conjugate-gradient  algorithms  all  incorporate  the  step- 
length  algorithm  TCGSL. 

Algorithm  CG 

The  traditional  conjugate-gradient  algorithm  (see  2.1,  2.3  and  2.8). 
Algorithm  PCG 

Diagonally  preconditioned  traditional  conjugate-gradient  algorithm  (see  2.1, 
2.3,  4.1,  4.4  and  4.5). 

Algorithm  BCG 

Beale’s  algorithm  with  the  Powell  restart  criterion  (see  2.8,  2.9,  2.10,  2.11  and 

2.12). 

Algorithm  PBCG 

Algorithm  BCG  with  diagonal  preconditioning. 

The  following  four  limited-memory  quasi-Newton  algorithms  all  incorporate 
step-length  algorithm  QNSL. 

Algorithm  Shanno 

An  implementation  of  Shanno’s  algorithm  (see  3.8  and  Shanno,  1978a).  Apart 
from  the  use  of  (3.8),  we  hare  attempted  to  follow  the  description  of  Shanno’s 
algorithm  as  closely  as  possible.  However,  we  must  emphasise  that  the  results 
obtained  will  not  necessarily  agree  with  those  of  other  implementations. 

Algorithm  PLM1 

Diagonally  preconditioned  on^step  BFGS  formula  (see  3.6).  The  direction  of 
search  is  computed  as  — H*+ija+i  where  /&+ 1  is  given  by 

Hk+ 1  P BFGS^k-l  1 1  **)• 

Algorithm  PL  M2 

Diagonally  preconditioned  two-step  BFGS  formula  (see  3.7  and  4.8). 
Algorithm  PLMA 
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Diagonally  preconditioned  two-step  BFGS  formula  with  accumulated  step  (see 
6.5,  6.2,  6.3  and  6.4). 

For  comparison  purposes,  the  smaller  test  problems  were  run  on  two  algo¬ 
rithms  designed  to  solve  small  dense  problems. 

Algorithm  MNM 

A  modified  Newton  method  using  first  and  second  derivatives  (see  Gill  and 
Murray,  1974a). 

Algorithm  QNM 

A  quasi-Newton  method  using  the  full  n  X  n  BFGS  update  of  the  approximate 
Hessian  Matrix  (see  Gill  and  Murray,  1972a). 

7.3  The  test  examples 

The  provision  of  suitable  test  problems  it  extremely  difficult.  Problems  that 
are  used  to  measure  the  efficiency  of  algorithms  for  small  dense  problems  are 
completely  unsatisfactory  since  the  algorithms  considered  here  arc  intended  only 
fer  large-scale  problems.  For  example,  it  is  pointless  to  test  a  two-step  limited 
memory  quasi-Newton  method  on  a  two  dimensional  problem  since  the  algorithm 
will  be  effectively  performing  a  full  qv.ssi-Newton  iteration! 

A  serious  difficulty  with  using  <ery  large  lest  problems  is  that,  for  all  but 
the  most  trivial  examples,  the  CPU  time  necessary  to  compute  the  objective  func¬ 
tion  will  be  very  large.  This  is  typically  the  case  if  we  attempt  to  use  real-world 
problems  for  testing  purposes.  Moreover,  it  is  desirable  that  problems  be  drfined 
in  such  a  way  that  they  may  be  used  by  other  researchers.  Large-scale  real-world 
problems  almost  invariably  are  written  in  a  non-portable  form  or  can  be  run  only 
with  vast  quantities  of  numerical  data. 

In  this  study  we  have  attempted  to  compromise  on  these  issues  by  collecting  a 
set  of  non-trivial  problems  that  can  be  run  with  moderate  ease  at  other  installations. 
A  total  of  25  problems  are  considered.  Of  these,  20  problems  are  of  dimension  50  or 
greater  and  9  problems  are  of  dimension  100.  For  the  purposes  of  the  comparison, 
each  algorithm  being  tested  was  run  a  total  of  93  times.  It  is  necessary  to  present 
an  extensive  number  of  results  because,  as  we  shall  demonstrate  in  Section  7.5, 
the  performance  of  conjugate-gradient  methods  is  generally  erratic.  If  we  arc  to 
identify  which  strategy  gives  a  true  improvement  in  performance,  a  wide  spectrum 
of  results  must  be  considered. 

The  test  examples  may  be  separated  into  two  classes.  The  first  class  con¬ 
tains  problems  whose  Hessian  matrix  at  the  solution  has  clustered  eigenvalues; 
the  second  contains  problems  whose  Hessian  matrix  has  an  arbitrary  eigenvalue 
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distribution. 


Example  1.  Penl  (Gill  el  »!.,  1972b) 


F(l)  -  a  £> -  1)*  +  *(£ A  -  })’• 


The  solution  varies  with  n,  but  s%  «  x»+i,  •  “  1#  •••>»» —  1*  All  the  runs  made 
were  with  o  —  1,  6  —  10-3.  With  these  values,  the  Hessian  matrix  at  the  solution 
has  n  —  1  eigenvalues  0(1)  and  one  eigenvalue  0(10  3).  The  Hessian  matrix  is 
full  and  consequently,  for  large  values  of  n,  conjugate-gradient  type  methods  are 
the  only  techniques  available. 

Example  2.  Pen2  (Gill  el  ai.,  1972b) 


rw  -  •  E(('*‘/1’ + -  *>)’ + (**‘/“  -  '~1/,,),) 

+ *((]£<"  -•+*)*?-  *)*+(*■  -  g)*)*. 


where  c<  —  e*/10  +  e*'”,,/,#  for  —  2 . n.  The  solution  varies  with  n,  but 

Xi  —  *+i  for  »  —  1 . n  —  1.  This  example  was  also  run  with  a  —  1  and  b  — 

10~3.  For  these  values  the  Hessian  matrix  at  the  solution  has  n  —  2  eigenvalues 
0(1)  and  two  eigenvalues  0(10“*).  The  Hessian  matrix  is  full. 


Example  3.  Pen3 


F(x)  —  a< 


n — 3 


!+•■’£(*+ 2**+»  + i°*+»  -  O’ 


•—I 


+ (E  (* + *■<+» + 10i-+>  -  i)’)(Xj  (j* + 

n — 2 

+ «*— E  (Ji‘ + ‘'+1  - 3)’ 


*»  +  l 


-3)’) 


1—1 


(Erf-  n))  +^(*.-1)a- 

1  '  t-1 


At  the  minimum,  this  function  hat  n/2  eigenvalue*  0(1)  and  n/2  eigenvalue* 
0(1 0~3).  The  Hessian  matrix  is  full. 

Example  4.  PSP  (Toint,  1978) 

50  1) 

Hi)  -  5)’  +  E  AM*)' 

»—  1  1  —  1 


where 

U.  —  4  —  2  *i  +  ^  x>' 
>6-4(0  >€®(0 


1  /ilk. 

100(0.1  —  w.)-h  10, 


V.>0.1, 

V,<0.1, 


and  the  constants  a„  A  »nd  d,  are  given  by  Toint  (1978)  together  with  the  sets 
of  indices  A(»)  and  B(i)  which  are  subsets  of  the  indices  {1,2, 3,... ,50).  This 
example  has  a  sparse  Hessian  matrix. 

The  remaining  examples  have  arbitrary  distributions  of  eigenvalues  at  the 
solution. 


Example  5.  Chebyquad  (Fletcher,  1965) 

30 


where 


FW- £/!(*)’. 
*— 1 


and  T*(s)  is  the  ith-order  shifted  Chebyshev  polynomial.  The  Hessian  matrix  is 
full. 

Example  6.  Watson  (Brent,  1973) 


This  problem  was  included  to  demonstrate  the  poor  convergence  of  conjugate- 
gradient  type  methods  on  mildly  ill-conditioned  problems  whose  eigenvalues  are 
uniformly  distributed.  Only  the  value  n  — ■  6  was  tested,  the  condition  number  of 
the  Hessian  matrix  being  approximately  8.6  X  10*. 

Example  7.  GenHose 

This  function  is  a  generalisation  of  the  well-known  two-dimensional  Rosenbrock 
function  (Rosenbrock,  1960). 


F(x)  -  I  +  £(100(*  -  !?_,)’  +  (1  -  *)*). 

•-1 


Our  implementation  of  this  function  differs  from  most  others  in  that  F(x)  is  unity 
at  the  solution  rather  than  xero.  This  modification  ensures  that  the  function  can¬ 
not  be  computed  with  unusually  high  accuracy  at  the  solution  and  is  therefore 
more  typical  of  practical  problems  (for  a  more  detailed  discussion  of  this  point, 
the  reader  is  referred  to  Gill  and  Murray,  1979). 
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The  next  four  examples  arise  from  the  discretisation  of  problems  in  the  cal¬ 
culus  of  variations.  Similar  problems  arise  in  numerical  solution  of  optimal  control 
problems.  The  general  continuous  problem  is  to  find  the  minimum  of  the  functional 

./(AO)  -  l  A‘. «(<).*'(<))<“. 

over  the  set  of  piecewise  differentiable  curves  with  the  boundary  conditions  x(0)  = 
a,  x(l)  —  6.  If  x(f)  is  expressed  as  a  linear  sum  of  functions  that  span  the  space  of 
piecewise  cubic  polynomials  then  minimization  of  J  becomes  a  finite-dimensional 
problem  with  a  tri-diagonal  Hessian  matrix. 

Example  8.  Calvarl  (Gill  and  Murray,  1973) 

./(AO)  -  J‘  {.(i)1 + ^(i)  f»-‘  A‘)  ->««(>+ *'<0,)‘}  <“■ 

with  the  boundary  conditions  x(0)  —  1,  x(l)  —  2. 

Example  9.  Calvar2 

./(AO)  -  I'  |l00(,(t)  -  x'(t)>)*  +  (1  - 

with  the  boundary  conditions  x(0)  ■■  x(l) «—  0. 

Example  10.  Calvar3  (Gill  and  Murray,  1973) 

4A0)  -  /’  {♦-‘•“’W  -  ')}<“■ 

with  the  boundary  conditions  x(0)  -■  1,  x(l)  —  0. 

Example  11.  Var(X)  (Toint,  1C78) 

•/(A0)-/{*'(0,+>*«'<0}<“. 

with  the  boundary  conditions  x(0)  x(l)  —  0. 

This  function  is  discretized  using  piecewise  linear  functions  instead  ofpiccew'ise 
cubics.  Like  the  other  discretized  problems,  the  Hessian  is  tri-diagonal.  The 
problem  is  quadratic  for  X  —  0. 
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Example  12.  QOR  (Toint,  1978) 


n*)-L°^+X>('''-  E  *>+  E  *<)  ■ 

i.|  »— I  '  >€A(»)  >€fl(0  ' 

where  the  constanU  o„  A,  A  and  >eU  A(»)  and  fl(.)  are  the  same  as  those  used  in 
example  PSP.  This  function  is  convex  with  a  sparse  Hessian  matrix. 

Example  13.  GOR  (Toint,  1978) 

M  33 

F[x)  —  Ci(xi)  +  A(v»)i 


where 


(o,x,lof,(l  +*»).  *»>°. 

W)  “  \_o«  log^l  +  x<),  *  <  0, 


Vi  —  **—  2  *i  +  2  *3 
>€A(0  >€«(•) 


v./AyJk^O  +  Wi),  v.^0, 

Mw)  W.  »<° 

Again  the  constanU  a„  A,  A  nnd  «u  *(»)  and  fl(i)  are  defined  as  in  Example  PSP. 
This  function  is  convex  but  there  are  discontinuities  in  the  second  derivatives. 

Example  14.  ChnRose  (Toint,  1978) 

n*)  -  * + £  (<*(*-»  -  *!)J + o  -  *)*)■ 

i-i 

where  the  constanU  a<  are  those  used  in  the  example  PSP.  The  value  of  F(x)  at  the 
solution  has  been  modified  as  in  Example  7.  The  Hessian  matrix  is  tri-diagonal. 


7.4  Starting  poinU 

The  starting  poinU  used  were  the  following. 
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Start  1 


Start  2 


Start  3 

Start  4 


Start  5 


Start  6 


*o  *™  (0»  0»  •  •  •  1 0)  • 

(—  —  —)T 
Vn+l,n+r'‘  ,n+i;  ' 

*0 -(1,-1, 1,-1,  ...)T. 


*o 


.6 

n+T 


x0  —  (-1,-1 . —  1)T. 


7.5  Discussion  of  the  results 

All  the  algorithms  are  coded  in  double- precision  Fortran  IV.  The  runs  were 
made  on  an  IBM  370/168,  for  which  the  relative  machine  precision,  c,  is  ap¬ 
proximately  10~15. 

.  Each  problem  was  solved  using  three  values  of  q,  the  step-length  accuracy. 
These  values  were  0.25,  0.1  and  0.001.  Each  algorithm  requires  two  additional 
user-specified  parameters:  the  bound  upon  the  change  in  x  at  each  iteration, 
(the  quantity  ||x*-^i  — x*||j)  and  Ftl j,  an  estimate  of  the  value  of  the  objective 
function  at  the  solution.  The  value  of  X  was  set  at  10s  for  all  problems  except 
Penl,  Pen2  and  Chebyquad  where  it  was  set  at  10  to  avoid  overflow  during  the 
computation  of  the  objective  function.  In  every  case  F,ti  was  set  to  the  value  of 
F(x)  at  the  solution. 

The  full  set  of  resultsarecontained  in  Tables  Al-A8of  the  appendix.  Table  A 1 
contains  the  number  of  function  evaluations  required  by  conjugate-gradient  type 
methods  on  problems  whose  Hessian  matrices  have  clustered  eigenvalues.  Table 
A2  gives  equivalent  results  for  limited-memory  quasi-Newton  methods.  Tables  A3 
and  A4  give  the  number  of  function  evaluations  required  by  conjugate-gradient 
methods  and  limited-memory  quasi-Newton  methods  on  general  problems. 

In  order  to  limit  the  computing  costs,  an  upper  bound  was  placed  upon  the 
number  of  function  evaluations  made  during  each  minimisation.  In  all  cases  where 
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this  bound  was  achieved  the  function  was  still  being  reduced  (albeit  slowly!).  The 
bound  varied  from  problem  to  problem,  but  in  most  cases  where  the  bound  was 
achieved,  the  progress  of  the  minimization  was  sufficiently  slow  that  considerably 
more  function  evaluations  would  have  been  required  to  obtain  any  meaningful 
accuracy.  An  upper  limit  of  2000  function  evaluations  was  placed  on  all  problems 
except  Watson,  for  which  the  limit  was  700.  The  large  value  of  the  limit  (relative 
to  the  size  of  n)  for  Watson  reflected  the  fact  that  the  objective  function  is  inex¬ 
pensive  to  compute.  Two  algorithms,  PBCG  and  PLMA,  successfully  solved  all 
the  test  problems. 

It  is  helpful  if  the  results  of  the  appendix  are  summarized  so  that  the  methods 
may  be  easily  “compared".  Tables  1  and  2  contain  the  total  number  of  function 
evaluations  required  by  each  method  on  the  two  classes  of  test  example.  Clearly 
the  reader  should  be  wary  of  using  these  tables  as  the  only  means  of  comparison 
since  the  totals  will  depend  upon  the  rj-values  used  and  the  limit  on  the  total 
number  of  function  evaluations.  However,  we  believe  that  Tables  1  and  2  serve  as 
a  useful  introduction  to  the  tables  given  in  the  appendix. 

The  results  indicate  that  the  performance  of  conjugate-gradient  type  methods 
may  be  erratic.  For  example,  quasi-Newton  methods  almost  invariably  have  the 
property  that  the  number  of  iterations  decreases  monotonically  as  the  step-length 
parameter  *7  is  reduced  to  zero.  This  implies  that  the  “best"  value  of  q  determined 
by  averaging  the  results  for  a  large  set  of  test  problems  will  be  close  to  the  “best” 
17  for  each  individual  problem.  However,  for  conjugate-gradient  type  methods  the 
variation  in  the  number  of  iterations  as  tj  changes  is  often  far  from  uniform.  (In 
this  case  a  user  may  find  it  worthwhile  to  experiment  with  the  choice  of  r\  if  a 
large  number  cf  similar  problems  are  being  solved.) 

Table  1 

Total  Number  of  Function  Evaluations  Required 
to  Solve  Problems  With  Clustered  Eigenvalues 


METHOD 

evaluations 

CG 

2189 

BCG 

1922 

PCG 

2269 

PBCG 

2177 

Shanno 

1858 

PLM1 

2062 

PLM2 

2085 

PLMA 

1948 
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This  erratic  behavior  makes  it  very  difficult  to  draw  firm  conclusions  about  whether 
one  algorithm  is  any  better  than  another.  However,  there  is  one  aspect  concern¬ 
ing  the  implementation  of  the  methods  on  which  a  firm  statement  can  be  made. 
The  implementations  of  all  the  algorithms  discussed  in  this  report  incorporate  a 
sophisticated  routine  to  compute  the  step  length.  We  believe  that  such  a  routine 
is  vital  for  both  efficiency  and  robustness. 

The  results  of  Tables  A1-A4  indicate  that  the  traditional  conjugate-gradient 
method  and  Beale's  method  are  very  effective  for  problems  whose  Hessian  matrix 
has  clustered  eigenvalues.  However,  these  methods  may  require  a  prohibitively 
large  number  of  function  evaluations  for  general  problems.  We  believe  that  the 
reputation  of  the  traditional  conjugate-gradient  method  for  unreliability  partly 
stems  from  the  use  of  an  inadequate  6tep-)ength  routine.  The  results  reported 
here  are  quite  acceptable  but  may  represent  the  best  that  can  be  achieved  with 
an  unmodi5ed  algorithm. 

We  would  expect  preconditioning  to  have  a  negative  effect  upon  the  rate 
of  convergence  for  problems  whose  Hessian  matrix  already  has  sets  of  clustered 
eigenvalues.  However,  the  results  of  Table  1  show  that  the  degradation  in  perfor¬ 
mance  on  “naturally"  preconditioned  problems  is  not  serious.  Moreover,  as  Table 
2  shows,  the  improvement  in  performance  on  general  problems  is  quite  dramatic 
overall.  For  this  reason  we  believe  diagonal  preconditioning  to  be  well  worthwhile. 

Table  2 

Total  Number  of  Function  Evaluations  Required 
to  Solve  General  Problems 


METHOD 

EVALUATIONS 

CG 

32820 

BCG 

28588 

PCG 

18597 

PBCG 

14714 

Shanno 

24434 

PLM1 

17188 

PLM2 

16934 

PLMA 

13985 

A  feature  of  methods  which  use  preconditioning  is  that  the  property  of  n- 
step  termination  is  lost.  (In  Section  5  this  property  was  shown  to  hold  numerically 
only  if  the  Hessian  matrix  is  well-conditioned.)  A  comparison  of  the  results  for  the 
well-conditioned  quadratic  function  Var(O)  shows  that  preconditioning  may  result 
in  three  times  the  number  of  function  evaluations  required  by  an  algorithm  that 


hat  rv-step  termination.  However,  this  ratio  is  reduced  significantly  when  the  near- 
quadratic  function  Var(l)  is  minimized. 

We  were  surprised  that  the  two-step  BFGS  formula  appeared  to  give  little 
advantage  over  the  one-step  BFGS  formula.  For  this  reason  it  is  not  clear  that 
increasing  the  number  of  updates  will  necessarily  improve  the  performance  of  a 
limited-memory  quasi-Newton  method.  This  point  is  emphasised  on  the  numerous 
occasions  that  the  limited-memory  method  PLMA  out-performs  the  full  quasi- 
Newton  method. 

The  most  successful  method  was  PLMA,  in  terms  of  the  total  number  of  func¬ 
tion  evaluations  required  to  solve  this  set  of  test  problems.  In  a  direct  comparison 
with  any  single  alternative  routine,  the  number  of  problems  on  which  PLMA  was 
more  successful  was  noticeably  greater  than  the  number  on  which  it  was  less  suc¬ 
cessful.  Thus  the  additional  storage  required  to  implement  a  more  sophisticated 
routine  6eems  to  be  warranted  on  the  grounds  of  both  improved  efficiency  and 
robustness. 

Tables  A5  and  A8  give  a  comparison  of  function  evaluations  for  PLMA  and 
the  two  methods  designed  for  dense  problems.  These  results  are  summarized  in 
Tables  3  and  4.  Tables  A7  and  A8  give  a  comparison  of  iterations  for  the  same 
problems.  (The  reader  should  note  that  the  values  of  rj  used  tor  these  comparisons 
are  not  optimal  for  the  methods  MNM  and  QNM.) 


Table  3 

Total  Number  of  Function  Evaluations  Required 
to  Solve  Problems  With  Clustered  Eigenvalues 


METHOD 

EVALUATIONS 

PLMA 

703 

QNM 

1738 

MNM 

274 

Table  4 

Total  Number  of  Function  Evaluations  Required 
to  Solve  General  Problems 


METHOD 

EVALUATIONS 

PLMA 

5505 

QNM 

3786 

MNM 

1819 

37 


The  comparison  of  PLMA  with  MNM  and  QNM  was  not  intended  to  be  an  ex¬ 
haustive  one,  but  merely  to  demonstrate  the  efficiency  of  methods  specifically 
designed  for  large-scale  problems  compared  to  general  methods.  As  we  might  ex¬ 
pect,  on  problems  with  clustered  eigenvalues  PLMA  does  relatively  well,  especially 
compared  to  QNM.  The  overall  performance  of  PLMA  compared  to  that  of  QNM 
is  also  respectable.  Perhaps  the  most  noticeable  feature  of  th:6  set  of  results  is  the 
often  spectacular  performance  of  the  modified  Newton  method  MNM. 

8.  Summary 

An  algorithm  for  large-scale  optimization  based  upon  preconditioning  a  two- 
step  BFGS  limited-memory  quasi-Newton  method  has  been  suggested.  An  im¬ 
plementation  of  the  method  has  been  shown  to  be  efficient  on  a  selection  of  large 
test  problems.  We  have  shown  that  the  performance  of  this  and  other  conjugate- 
gradient  algorithms  varies  significantly  with  the  accuracy  to  which  the  step  length 
is  computed  and  the  type  of  problem  being  solved.  This  suggests  that  a  user  should 
experiment  with  the  step-lrngth  parameter  when  solving  many  similar  problems. 

A  key  point  when  solving  problems  by  a  conjugate-gradient  algorithm  is  not 
to  attempt  to  find  too  accurate  a  solution.  The  rate  of  convergence  for  all  the 
methods  is  effectively  linear.  On  a  machine  with  a  t  decimal  digit  mantissa,  we 
suggest  termination  when  F(x*)  is  estimated  to  have,  at  most,  min{f/2, 5}  digits  of 
accuracy.  For  many  applications  this  approximate  solution  is  more  than  adequate, 
but  the  low  level  of  accuracy  may  imply  that  we  are  unable  to  determine  that  a 
correct  solution  has  been  found. 

In  order  to  solve  very  large  problems,  computing  restrictions  may  require  the 
number  of  iterations  to  be  a  small  multiple  of  n.  It  should  be  noted  that,  even 
with  the  optimal  value  of  the  step-length  accuracy,  the  method  judged  to  be  the 
best  of  those  tested  often  failed  to  achieve  this  objective. 
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APPENDIX 


TABLE  A1 

Number  of  Function  Evaluations  Required 
by  Conjugate-gradient  Type  Methods 
on  Problems  With  Clustered  Eigenvalues 


PROBLEM 

i  1 

CG 

BCG 

PCG 

PBCG 

37 

46 

44 

44 

Pen  1 

Start  3 

n  -■  50 

KBl  - 

27 

27 

31 

31 

m 

32 

32 

38 

38 

■a 

42 

42 

Penl 

Start  3 

a 

I 

O 

O 

9 

9 

11 

11 

worn 

9 

9 

11 

11 

naa 

19 

19 

82 

81 

Pen] 

Start  2 

n  ■■  50 

■  B*M 

25 

32 

54 

57 

■SSI  1 

33 

33 

62 

50 

wnwm 

17 

17 

62 

67 

Penl 

Start  2 

n-  100 

IBfl 

45 

45 

84 

72 

■m 

58 

58 

117 

125 

■aa 

22 

22 

43 

58 

Pen2 

Start  5 

n  -  50 

24 

24 

45 

44 

■KB 

44 

46 

67 

124 

ma 

15 

15 

11 

11 

Pen  2 

Start  5 

n  »  100 

IBfl 

13 

13 

12 

12 

mm 

21 

19 

18 

18 

KEY 

CG  -  traditional  conjugate-gradient  method. 

PCG  •  algorithm  CG  with  diagonal  preconditioning. 
BCG  -  Beale's  method  with  Powell  restarts. 

PBCG  •  diagonally  preconditioned  Beale's  method. 
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TABLE  A1  (continued) 

Number  of  Function  Evaluation*  Required 
by  Conjugate-gradient  Type  Method* 
on  Problem*  With  Clustered  Eigenvalue* 


PROBLEM 

_ 5 _ 

CG 

BCG 

PCG 

PBCG 

-  ■  - 

na 

99 

105 

52 

65 

Pen2 

Start  3 

n 

50 

90 

94 

74 

74 

mam 

121 

121 

106 

106 

na 

17 

20 

14 

14 

Pen2 

Start  3 

n 

— 

100 

13 

19 

17 

17 

m 

28 

32 

30 

30 

■  ...  - 

■an 

97 

79 

64 

71 

Pen3 

Start  1 

n 

_ 

50 

94 

84 

77 

57 

mam 

119 

89 

94 

73 

|g| 

77 

70 

74 

Pen  3 

Start  1 

n 

100 

74 

76 

74 

■M 

92 

132 

99 

■aa 

123 

69 

82 

76 

Pen  3 

Start  3 

n 

= 

50 

117 

75 

90 

76 

■EM 

99 

85 

117 

95 

ma 

97 

87 

80 

83 

Pen3 

Start  3 

n 

— 

100 

I'gj 

108 

98 

109 

62 

n 

112 

97 

132 

96 

| 

KUEi 

5 

5 

5 

5 

PSP 

Start  1 

B 

— 

50 

KtSfl 

7 

7 

7 

7 

■ 

n 

7 

7 

7 

7 

KEY 

CG  -  traditional  conjugate-gradient  method. 

PCG  •  algorithm  CG  with  diagonal  preconditioning. 
BCG  -  Beale'*  method  with  Powell  restart*. 

PBCG  -  diagonally  preconditioned  Beale'*  method. 


TABLE  A2 

Number  of  Function  Evaluation*  Required  By 
Limited-memory  Quasi-Newton  Methods 
on  Problems  With  Clustered  Eigenvalues 


PROBLEM 

_ 2 _ 

Sh*nno  PLM1 

PL  M3 

PLMA 

| 

■ 

■  | 

ica 

27 

53 

53 

53 

Penl 

Start  3 

■fin 

37 

28 

29 

27 

■ 

■ 

■ 

■BSB 1 

49 

34 

33 

32 

tea 

28 

46 

44 

40 

Penl 

Start  3 

n 

— 

100 

13 

10 

10 

10 

mam 

14 

10 

10 

10 

3 

1 

BSI 

21 

21 

20 

21 

Penl 

Start  2 

n 

s 

1 

IBi 

23 

28 

28 

32 

i 

m 

■M 

25 

23 

28 

29 

■BJ  1 

19 

23 

20 

Penl 

Start  2 

n 

— 

100 

IBM 

45 

41 

41 

44 

m  i 

39 

48 

48 

57 

I 

ica 

15 

35 

51 

JLM 

Pen2 

Start  5 

50 

ISM 

33 

36 

33 

JEM 

■ 

warn 

42 

67 

77 

WBM 

■ 

ca 

8 

10 

15 

28 

Pen2  * 

Start  5 

B 

— 

100 

14 

12 

13 

13 

■ 

warn 

21 

18 

18 

23 

KEY 

Shanno  -  Shanno's  method. 

PLM1  -  preconditioned  one-step  BFGS. 

PLM2  -  preconditioned  two-step  BFGS. 

PLMA  •  method  PL  M2  with  accumulated  step. 
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TABLE  A2  (continued) 

Number  of  Function  Evaluttiont  Required  by 
Limited-memory  Quasi-Newton  Methods 
on  Problems  With  Clustered  Eigenvalues 


PROBLEM 

3 

Shanno 

PLMl 

PLM2 

PLMA 

MCE 

.25 

74 

58 

70 

118 

Pen2 

Start  3 

n  ■■  50 

B 

.1 

99 

80 

68 

76 

■X 

.001 

127 

106 

82 

71 

B 

.25 

17 

12 

16 

28 

Pen  2 

Sun  3 

n=  100 

.1 

16 

17 

15 

18 

WZ 

.001 

32 

30 

23 

29 

■ 

B 

.25 

sa 

85 

65 

Pen3 

Start  1 

B 

.1 

ml 

90 

62 

MMi 

■X 

.001 

80 

104 

71 

EE 

.25 

89 

115 

95 

77 

Pen  3 

Start  1 

n—  100 

B 

.1 

84 

120 

109 

87 

H 

.001 

81 

132 

94 

79 

■X 

.25 

83 

89 

101 

76 

Pen  3 

Start  3 

n  —  50 

n 

.1 

75 

89 

107 

76 

H 

.001 

85 

117 

116 

71 

■E 

.25 

85 

82 

87 

85 

Pen  3 

Start  3 

n—  100 

B 

.1  . 

79 

84 

111 

94 

H 

.001 

94 

132 

124 

83 

B 

.25 

4 

4 

4 

4 

PSP 

SUrt  1 

n  —  50 

f£ 

.1 

6 

7 

6 

6 

ifl 

.001 

_ L_ 

7 

7 

7 

KEY 

Shanno  •  Shanno’s  method. 

PLM1  -  preconditioned  one-step  BFGS. 

PLM2  -  preconditioned  two-step  BFGS. 

PLMA  -  method  PL  M2  with  accumulated  step. 


TABLE  A3 

Number  of  Function  Evaluation*  Required 
by  Conjugate-gradient  Type  Method* 
on  General  Problem* 


PROBLEM 

_ 3 _ 

CG  BCG  PCG  PBCG 

Chebyquad  Start  2  n  —  6 

wSmmSmmmSM 

Chebyquad  Start  2  n  -■  8 

31  22  24  26 

31  30  26  27 

32  34  34  37 

Chebyquad  Start  2  n  -»  20 

109  85  89  79 

98  119  85  81 

138  132  94  86 

Wataon  Start  1  n  —  6 

558  135  >700  546 

456  81  >700  200 

470  161  >700  609 

— 

GenRoae  Start  2  n  — •  50 

652  551  216  190 

667  599  219  202 

748  664  269  250 

GenRoae  Start  2  n*  100 

KSKBESSI 

Calvarl  Start  1  n  —  50 

>2000  >2000  346  427 

>2000  >2000  354  425 

>2000  >2000  419  433 

Calvarl  Start  1  n  —  100 

>2000  >2000  815  828 

>2000  >2000  821  857 

>2000  >2000  929  985 

KEY 

CG  •  traditional  conjugate-gradient  method. 

PCG  •  algorithm  CG  with  diagonal  preconditioning. 
BCG  •  Beale'*  method  with  Powell  restart*. 

PBCG  -  diagonally  preconditioned  Beale'*  method. 
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TABLE  A3  (continued) 

Number  of  Functba  Evaluations  Required 
bj  Conjugate-gradient  Type  Methods 
on  General  Problems 


PROBLEM 

CG 

BCG 

PCG 

PBCG 

■ 

MTXm 

239 

178 

159 

126 

Calvar2 

Start  1 

J?t] 

Ini s- 

239 

191 

164 

129 

mm 

251 

190 

1G8 

124 

warm 

409 

299 

284 

Ca!var2 

Start  1 

n  —  100 

BSfl 

349 

305 

227 

IBS 

371 

312 

238 

709 

162 

157 

Calvar3 

Start  1 

n  — 50 

BBl 

709 

159 

145 

n 

784 

172 

160 

m 

1903 

1363 

292 

277 

Calvar3 

Start  1 

n—  100 

Kfifl 

1927 

1340 

275 

280 

n 

1695 

1507 

290 

284 

BO 

200 

200 

675 

673 

Var(O) 

Start  4 

n-  100 

IBM 

200 

200 

701 

529 

MESS 

200 

200 

774 

512 

■ m 

150 

130 

322 

315 

Var(l) 

Start  4 

n  —  50 

tSSm 

150 

130 

334 

272 

n 

151 

131 

355 

240 

■  lj  a 

344 

284 

638 

639 

V«(1) 

Start  4 

n—  100 

IBM 

344 

284 

665 

555 

n 

347 

289 

717 

544 

KEY 

CG  -  traditional  conjugate-gradient  method. 

PCG  -  algorithm  CG  with  diagonal  preconditioning. 
BCG  •  Beale's  method  with  Powell  restarts. 

PBCG  •  diagonally  preconditioned  Beale's  method. 
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TABLE  A3  (continued) 

Number  of  Function  Evaluation*  Required 
by  Conjufate-gradient  Type  Methods 
on  General  Problems 


PROBLEM 

1 

CG 

BCG 

PCG 

PBCG 

■  1  ■ 

27 

27 

29 

29 

QOR 

Start  1 

tfil 

27 

27 

29 

29 

n 

27 

27 

29 

29 

mjstsm 

87 

81 

72 

71 

GOR 

Start  1 

n  —  50 

■Bl 

87 

81 

76 

76 

in 

108 

99 

95 

92 

81 

82 

74 

57 

ChnRose 

Start  6 

n  — 25 

■CSfl 

84 

78 

82 

63 

IBS 

100 

106 

86 

95 

KEY 

CG  •  traditional  conjufate-fradient  method. 

PCX!  -  algorithm  CG  with  diagonal  preconditioning. 
BCG  -  Beale's  method  with  Powell  restarts. 

PBCG  -  diagonally  preconditioned  Beale's  method. 
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1 


*7 1 


PROBLEM 

Chebjrquad 

Start  2 

n  =  6 

Chebjquad 

Start  2 

n  ■*  8 

Chebyquad 

Start  2 

n  «  20 

Watson 

Start  1 

n  “  6 

GenRose 

Start  2 

n  — 50 

GenRose 

Start  2 

n  —  100 

Calvarl 

Start  1 

n  —  50 

Calvarl 

Start  1 

n  ™  100 

Shtnno 


19 

2 

2 


27 

29 

43 


72 

71 

88 


174 

175 
186 


24 

25 


403 

416 

526 


>2000 

>2000 

>2000 


>2000 

>2000 

>2000 


PLMl 


18 
22 
27 


26 

27 

34 


80 

95 

87 


>700 

>700 

>700 


199 

210 

261 


330 

348 

480 


451 

464 

478 


841 

951 

992 


PLM2  PLMA 


28 

30 

34 


80 

86 

83 


>700 

>700 

>700 


203 

207 

271 


328 

361 

478 


428 

406 

457 


879 

904 

1025 


21 

21 

26 


75 

71 

90 


294 

406 

316 


201 

263 

330 


3G5 

410 

528 


366 

401 

456 


819 

854 

905 
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TABLE  A4  (continued) 

Number  of  Function  Evaluations  Required  by 
Limited-memory  Quasi-Newton  Methods 
on  General  Problems 


PROBLEM 

V 

Shftnno 

PLMl 

PLM2 

PLMA 

rj  —  .25 

382 

168 

127 

106 

Calvar2 

Start  1 

n 

50 

TJ  — .1 

275 

159 

134 

118 

f)  -  .001 

192 

168 

158 

123 

rj  —  .25 

690 

253 

351 

204 

Ca!var2 

Start  1 

n 

100 

q  — .1 

548 

308 

310 

206 

TJ-.001 

381 

328 

330 

228 

rj  -.25 

1040 

162 

165 

152 

Calvar3 

Start  1 

n 

_ 

50 

q  —  .1 

861 

172 

170 

155 

T]  •—  .001 

611 

186 

178 

161 

q-,25 

1891 

308 

308 

270 

Calvar3 

Start  1 

n 

mm 

100 

r>  —  .1 

>2000 

304 

306 

281 

rj  -  .001 

1322 

318 

314 

284 

.25 

463 

614 

603 

475 

Var(O) 

Start  4 

n 

— 

100 

r;  —  .1 

413 

762 

746 

494 

r,-  .001 

201 

787 

783 

579 

rj  “  .25 

184 

276 

321 

199 

Var(l) 

Start  4 

n 

— 

50 

^-.1 

139 

358 

303 

224 

r)  =*  .001 

128 

347 

361 

261 

rj  =  .25 

398 

664 

545 

497 

Var(l) 

Start  4 

n 

100 

»j  — .1 

412 

690 

654 

534 

TJ  ■■  .001 

288 

720 

713 

547 

KEY 

Shanno  -  Shanno's  method. 

PLM1  -  preconditioned  one-step  BFGS. 

PLM2  •  preconditioned  two-step  BFGS. 

PLMA  •  method  PLM2  with  accumulated  step. 
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TABLE  A4  (continued) 

Number  of  Functioo  Evaluation*  Required  by 
Limited-memory  Quasi-Newton  Method* 
on  General  Problems 


PROBLEM 

— 

_ 1 _ 

Shtnoo  PLMI 

PLM2 

PLMA 

naa 

28 

29 

29 

29 

QOR 

Start  1 

n  -  50 

tCHl 

27 

29 

27 

29 

n 

27 

29 

27 

29 

ioa 

73 

73 

79 

71 

GOR 

Start  1 

n  -  50 

Ism 

78 

81 

77 

76 

HO 

94 

95 

94 

97 

■ya 

60 

69 

57 

82 

ChnRose 

Start  6 

n  —  25 

■BJ  ■ , 

82 

102 

103 

76 

n 

95 

108 

116 

119 

KEY 

Shanno  •  Shanno's  method. 

PLMi  -  preconditioned  one-step  BFGS. 

PLM2  -  preconditioned  two-step  BFGS. 

PLMA  -  method  PLM2  with  accumulated  step. 


TABLE  A5 

Number  of  Function  Evaluations  Required 
by  Preconditioned  Limited-Memory 
Methods  v«  Modified-Newton  and 
Full  Quaai-Newton  Methods 
on  Problems  With  Clustered  Eigenvalues 


PROBLEM 

_ _ 5 _ 

PLMA 

MNM 

QNM 

mai 

53 

18 

33 

Penl 

Start  3 

n  —  50 

■SI 

27 

25 

26 

H 

32 

29 

31 

mi 

40 

NR 

NR 

Penl 

Start  3 

n-  100 

isi 

10 

NR 

NR 

HI 

10 

NR 

NR 

EB9 

21 

11 

22 

Penl 

Start  2 

ESI 

32 

16 

30 

■m 

29 

18 

22 

iua 

20 

NR 

NR 

Penl 

Start  2 

n—  100 

!»■ 

44 

NR 

NR 

BHi 

57 

NR 

NR 

im 

67 

15 

214 

Pen2 

Start  5 

47 

18 

239 

■m 

112 

19 

290 

■m 

23 

NR 

NR 

Pen2 

Start  5 

n-  100 

13 

NR 

NR 

■H9 

23 

NR 

NR 

iua 

28 

NR 

NR 

Pen2 

Start  5 

n  —  100 

tSl 

13 

NR 

NR 

HI 

23 

NR 

NR 

KEY 

PLMA  -  preconditioned  two-step  BFGS  with  accumulated  step. 
MNM  -  modified  Newton  method. 

QNM  •  quasi-Newton  method. 

NR  -  Not  run. 
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TABLE  A5  (continued) 

Number  of  F unction  Evslustioni  Required 
by  Preconditioned  Limited-Memory 
Methods  vs  Modified-Newton  and 
Full  Quasi-Newton  Methods 
on  Problems  With  Clustered  Eigenvalues 


PROBLEM 

V 

PLMA 

MNM 

QNM 

q  “  .25 

118 

17 

242 

Pen2 

Start  3 

n  —  50 

q-.l 

76 

31 

322 

X]  -  .001 

71 

26 

341 

q  ™  .25 

28 

NR 

NR 

Pen2 

Start  3 

n  =  100 

V  —  l 

18 

NR 

NR 

x]  —  .001 

29 

NR 

NR 

q  —  .25 

65 

14 

115 

Pen  3 

Start  1 

n  50 

q  —  .1 

62 

20 

113 

q  —  .001 

71 

24 

146 

q  -  .25 

77 

NR 

NR 

Pen  3 

Start  1 

n—  100 

q-i 

87 

NR 

NR 

C3 

1 

'o 

o 

79 

NR 

NR 

q  -  .25 

76 

44 

135 

Pen  3 

Start  3 

n  50 

q  —  l 

76 

44 

150 

o 

o 

1 

c- 

71 

48 

155 

q-.25 

85 

NR 

NR 

Pen  3 

Start  3 

n  -  100 

q  —  l 

94 

NR 

NR 

q  =  .001 

83 

NR 

NR 

q-  .25 

4 

2 

5 

PSP 

Start  1 

n  -  50 

q  —  -1 

6 

2 

7  1 

q  -  .001 

7 

2 

7. _ 1 

KEY 

PLMA  -  preconditioned  two-step  BFGS  with  accumulated  step. 
MNM  •  modified  Newton  method. 

QNM  -  quasi-Newton  method. 

NR  -  Not  run. 


50 


TABLE  A6 

Number  of  Function  Evaluations  Required 
by  Preconditioned  Limited-Memory 
Method*  vs  Modified-Newton  and 
Full  Quasi-Newton  Methods 
on  General  Problems 


PROBLEM 

V 

PLMA 

MNM 

QNM 

■a a 

17 

18 

13 

Chebyquad 

Start  2 

n  6 

18 

39 

15 

■BO 

26 

56 

19 

■m 

21 

38 

21 

Chebyquad 

Start  2 

n  —  8 

KfiBl 

21 

64 

25 

* 

■Bfl 

26 

68 

36 

■sa 

75 

121 

65 

Chebyquad 

Start  2 

n-20 

71 

116 

67 

■BO 

90 

161 

92 

■  LSI 

294 

11 

28 

Watson 

Start  1 

n  ■=  8 

■si 

15 

37 

RCE9 

316 

27 

55 

ma 

201 

202 

287 

GenRose 

Start  2 

n  ■*  50 

257 

323 

■HSO 

392 

412 

muim 

365 

NR 

NR 

GenRose 

Start  2 

n  —  100 

HM 

410 

NR 

NR 

MO 

528 

.. 

NR 

NR 

■  m 

368 

9 

191 

Calvarl 

Start  1 

n  50 

HH 

401 

il 

214 

mom 

456 

17 

269 

mssa 

819 

NR 

NR 

Calvarl 

Start  1 

n-  100 

854 

NR 

NR 

— i 

905 

NR 

NR 

KEY 

PLMA  -  preconditioned  two-step  BFGS  with  accumulated  step. 
MNM  -  modified  Newton  method. 

QNM  -  quasi-Newton  method. 

NR  -  Not  run. 
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TABLE  A6  (continued) 

Number  of  Function  Evaluation*  Required 
by  Preconditioned  Limited-Memory 
Metboda  v*  Modified-Newton  and 
Full  Quasi-Newton  Method* 
on  General  Problem* 


PROBLEM 

V 

PLMA 

MNM 

QNM 

29 

3 

39 

QOR 

Start  1 

n  —  50 

■sa 

29 

3 

27 

jLaJ 

29 

3 

27 

mam 

71 

5 

59 

GOR 

Start  1 

n-  50 

KBI 

76 

5 

59 

■m 

97 

7 

72 

82 

28 

97 

ChnRoae 

Start  6 

n  —  25 

Ism 

76 

48 

122 

BBSS 

119 

47 

164 

KEY 

PLMA  -  preconditioned  two-*tep  BFGS  with  accumulated  *tep. 
MNM  -  modified  Newton  method. 

QNM  -  quaai-Newton  method. 

NR  -  Not  run. 


S3 


TABLE  A7 

Number  of  Iterations  Required 
by  Preconditioned  Limited-Memory 
Methods  vs  Modified-Newton  and 
Full  Quasi-Newton  Methods 
on  Problems  With  Clustered  Eigenvalues 


PROBLEM 

PLMA 

MNM 

. <*NM— . 

V  *» 

.25 

22 

17 

27 

Penl 

Start  3 

n  =  50 

T]  — 

.1 

8 

9 

8 

q- 

.001 

8 

7 

8 

■X 

.25 

17 

NR 

Ml 

Penl 

Start  3 

n  «=  100 

■E 

.1 

NR 

NR 

IE 

.001 

NR 

NR 

■X 

.25 

8 

11 

17 

Penl 

Start  2 

n  «■  50 

M 

•1 

11 

6 

10 

KH 

.001 

7 

5 

5 

T)  — 

.25 

9 

NR 

NR 

Penl 

Start  2 

n  100 

v  — 

.1 

13 

NR 

NR 

r,  - 

.001 

14 

NR 

NR 

.25 

31 

14 

98 

Pen2 

Start  5 

n  =■*  50 

V  — 

.1 

15 

6 

72 

TJ  = 

.001 

27 

5 

G2 

rj  — 

.25 

13 

NR 

NR 

Pen2 

Start  5 

n  *  100 

V  **= 

.1 

6 

NR 

NR 

. 

n  — 

.001 

6 

NR 

NR 

KEY 

PLMA  -  preconditioned  two-step  BFGS  with  accumulated  step. 
MNM  -  modified  Newton  method. 

QNM  -  quasi-Newton  method. 

NR  -  Not  run. 
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TABLE  A7  (continued) 

Number  of  Iterations  Required 
by  Preconditioned  Limited-Memory 
Method*  vt  Modified-Newton  and 
Full  Quaai-Newton  Method* 
on  Problem*  With  Cluttered  Eigenvalue* 


PROBLEM 

q 

PLMA  MNM  QNM 

Pen2  Start  3  n  —  50 

q-.25 
q  —  .1 
q  -  .001 

52  17  134 

28  9  99 

15  6  73 

Pen2  Start  3  n  «■  100 

q  ■*  .25 
q  =»  .1 
q  —  .001 

14  NR  NR 

7  NR  NR 

7  NR  NR 

Pen3  Start  1  n  —  50 

q-  .25 
q  —  l 

q  -  .001 

35  10  57 

30  8  51 

28  8  54 

Pen3  Start  1  n  —  100 

q  —  .25 
q  —  1 

q  —  .001 

39  NR  NR 

44  NR  NR 

34  NR  NR 

Pen3  Start  3  n  «  50 

q-.25 
q  —  .1 
q  -  .001 

40  40  67 

38  12  63 

28  11  56 

Pen 3  Start  3  n  —  100 

q-  .25 
q  —  .1 
q-  .001 

49  NR  NR 

48  NR  NR 

35  NR  NR 

PSP  Start  1  n  —  50 

q-  .25 
q  —  .1 
q  -  .001 

3  2  4 

3  2  3 

3  2  3 

KEY 

PLMA  -  preconditioned  two-*tep  BFGS  with  accumulated  step. 
MNM  -  modified  Newton  method. 

QNM  -  quaai-Newton  method. 

NR  -  Not  run. 
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TABLE  A8 

Number  of  limtioni  Required 
by  Preconditioned  Limited-Memory 
Methods  vs  Modified-Newton  and 
Full  Quasi-Newton  Methods 
on  General  Problems 


PROBLEM 

r - 

PLMA 

MNM 

QNM 

— 

.25 

8 

4 

8 

Chebyquad 

Start  2 

n  =  6 

KE: 

.1 

4 

8 

BE 

.001 

C 

mm 

EE 

.25 

G 

Chebyquad 

Start  2 

n  *=  8 

.1 

8 

ra 

wt 

.001 

10 

ra 

V  — 

.25 

38 

29 

32 

Chebyquad 

Start  2 

o 

0* 

B 

e 

1  — 

.1 

33 

24 

28 

V  — 

33 

28 

♦  — 

.25 

11 

25 

Watson 

Start  1 

n  —  6 

J]  — 

.1 

188 

7 

18 

V  “ 

129 

8 

I® _ 

n  “ 

.25 

128 

GenRose 

Start  2 

n  ■■  50 

»?  — 

.1 

119 

GG 

118 

119 

88 

118 

.25 

191 

NR 

NR 

GenRoae 

Start  2 

n  «=  100 

q  = 

•1 

192 

NR 

NR  | 

n  — 

188 

NR 

NR 

EE 

.25 

194 

7 

1G2 

Calvarl 

Start  1 

n  — •  50 

IE 

.1 

G 

89 

BH 

.001 

205 

6 

88 

»?  — 

.25 

423 

NR 

NR 

Calvarl 

Start  1 

n-  100 

»?  — 

.1 

429 

NR 

NR 

7" 

.001 

416 

NR 

NR 

KEY 

PLMA  -  preconditioned  two-step  BFGS  with  accumulated  step. 
MNM  -  modified  Newton  method. 

QNM  •  quasi-Newton  method. 

NR  -  Not  run. 
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TABLE  A8  (continued) 
Number  of  iterations  Required 
by  Preconditioned  Limited-Memory 
Methods  vs  Modified-Newton  and 
Full  Quasi-Newton  Methods 
on  General  Problems 


PROBLEM 

7 

PLMA 

MNM 

QNM 

7  — 

.25 

64 

4 

28 

Caivar2 

Start  1 

n  —  50 

7  — 

.1 

61 

4 

28 

n  — 

.001 

60 

4 

28 

7  «= 

.25 

112 

NR 

NR 

Calvar2 

Start  1 

n-  100 

7  — 

.1 

107 

NR 

NR 

v  — 

.001 

113 

NR 

NR 

7  — 

.25 

80 

6 

90 

Calvar3 

Start  1 

n  — 50 

7  — 

.1 

78 

5 

59 

7  — 

.001 

77 

5 

59 

n  — 

.25 

143 

NR 

NR 

Calvar3 

Start  1 

n  —  100 

7  — 

.1 

142 

NR 

NR 

n- 

.001 

138 

NR 

NR 

7  — 

.25 

266 

NR 

NR 

Var(O) 

Start  4 

n-  100 

q  *= 

.1 

255 

NR 

Nil 

.001 

289 

NR 

NR 

7  — 

.25 

115 

3 

52 

Var(l) 

Start  4 

n  -  50 

7  — 

.1 

117 

3 

51 

.001 

129 

3 

51 

*1  — 

.25 

273 

NR 

NR 

Var(l) 

Start  4 

n  —  100 

7  — 

.1 

274 

NR 

NR 

in 

.001 

271 

NR 

NR 

KEY 

PLMA  -  preconditioned  two-step  BFGS  with  accumulated  step. 
MNM  -  modified  Newton  method. 

QNM  -  quasi- New  ton  method. 

NR  -  Not  run. 


TABLE  A8  (continued) 
Number  of  Iterstbat  Required 
by  Preconditioned  Limited-Memory 
Methods  vs  Modified-Newton  and 
Full  Quasi-Newton  Methods 
on  General  Problems 


PROBLEM 

9 

PLMA 

MNM 

QNM 

■BJ  | 

3 

23 

QOR 

Start  1 

n  ■■  50 

3 

13 

■BO 

3 

13 

iuj 

41 

5 

29 

GOR 

Start  1 

n  ■*  50 

41 

5 

29 

■BO 

42 

5 

29 

■oa 

40 

15 

48 

ChnRose 

Start  6 

n  — 25 

KQOfl 

37 

16 

46 

KB9 

43 

12 

47 
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CONJUGATE-GRADIENT  METHODS 

FOR  LARGE-SCALE  NONLINEAR  OPTIMIZATION 

In  this  paper  we  discuss  several  recent  conjugate-gradient  type  methods  for 
solving  large-scale  nonlinear  optimization  problems.  We  demonstrate  how  the 
performance  of  these  methods  can  be  significantly  improved  by  careful  Imple¬ 
mentation.  A  method  based  upon  Iterative  preconditioning  will  be  suggested 
which  performs  reasonably  efficiently  on  a  wide  variety  of  significant 
test  problems. 

Our  results  indicate  that  nonlinear  conjugate-gradient  methods  behave  in  a 
similar  way  to  conjugate-gradient  methods  for  the  solution  of  systems  of 
linear  equations.  These  methods  work  best  on  problems  whose  Hessian  matrices 
have  sets  of  clustered  eigenvalues.  On  more  general  problems,  however, 
even  the  best  method  may  require  a  prohibitively  large  number  of  iterations. 
We  present  numerical  evidence  that  indicates  that  the  use  of  theoretical 
analysis  to  predict  the  performance  of  algorithms  on  general  problems  is 
not  straightforward. 


